stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
05b50d2b96
commit
6d224861c4
1168 changed files with 120 additions and 358547 deletions
72
.beads/.gitignore
vendored
72
.beads/.gitignore
vendored
|
|
@ -1,72 +0,0 @@
|
|||
# Dolt database (managed by Dolt, not git)
|
||||
dolt/
|
||||
|
||||
# Runtime files
|
||||
bd.sock
|
||||
bd.sock.startlock
|
||||
sync-state.json
|
||||
last-touched
|
||||
.exclusive-lock
|
||||
|
||||
# Daemon runtime (lock, log, pid)
|
||||
daemon.*
|
||||
|
||||
# Interactions log (runtime, not versioned)
|
||||
interactions.jsonl
|
||||
|
||||
# Push state (runtime, per-machine)
|
||||
push-state.json
|
||||
|
||||
# Lock files (various runtime locks)
|
||||
*.lock
|
||||
|
||||
# Credential key (encryption key for federation peer auth — never commit)
|
||||
.beads-credential-key
|
||||
|
||||
# Local version tracking (prevents upgrade notification spam after git ops)
|
||||
.local_version
|
||||
|
||||
# Worktree redirect file (contains relative path to main repo's .beads/)
|
||||
# Must not be committed as paths would be wrong in other clones
|
||||
redirect
|
||||
|
||||
# Sync state (local-only, per-machine)
|
||||
# These files are machine-specific and should not be shared across clones
|
||||
.sync.lock
|
||||
export-state/
|
||||
export-state.json
|
||||
|
||||
# Ephemeral store (SQLite - wisps/molecules, intentionally not versioned)
|
||||
ephemeral.sqlite3
|
||||
ephemeral.sqlite3-journal
|
||||
ephemeral.sqlite3-wal
|
||||
ephemeral.sqlite3-shm
|
||||
|
||||
# Dolt server management (auto-started by bd)
|
||||
dolt-server.pid
|
||||
dolt-server.log
|
||||
dolt-server.lock
|
||||
dolt-server.port
|
||||
dolt-server.activity
|
||||
|
||||
# Corrupt backup directories (created by bd doctor --fix recovery)
|
||||
*.corrupt.backup/
|
||||
|
||||
# Backup data (auto-exported JSONL, local-only)
|
||||
backup/
|
||||
|
||||
# Per-project environment file (Dolt connection config, GH#2520)
|
||||
.env
|
||||
|
||||
# Legacy files (from pre-Dolt versions)
|
||||
*.db
|
||||
*.db?*
|
||||
*.db-journal
|
||||
*.db-wal
|
||||
*.db-shm
|
||||
db.sqlite
|
||||
bd.db
|
||||
# NOTE: Do NOT add negation patterns here.
|
||||
# They would override fork protection in .git/info/exclude.
|
||||
# Config files (metadata.json, config.yaml) are tracked by git by default
|
||||
# since no pattern above ignores them.
|
||||
|
|
@ -1,81 +0,0 @@
|
|||
# Beads - AI-Native Issue Tracking
|
||||
|
||||
Welcome to Beads! This repository uses **Beads** for issue tracking - a modern, AI-native tool designed to live directly in your codebase alongside your code.
|
||||
|
||||
## What is Beads?
|
||||
|
||||
Beads is issue tracking that lives in your repo, making it perfect for AI coding agents and developers who want their issues close to their code. No web UI required - everything works through the CLI and integrates seamlessly with git.
|
||||
|
||||
**Learn more:** [github.com/steveyegge/beads](https://github.com/steveyegge/beads)
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Essential Commands
|
||||
|
||||
```bash
|
||||
# Create new issues
|
||||
bd create "Add user authentication"
|
||||
|
||||
# View all issues
|
||||
bd list
|
||||
|
||||
# View issue details
|
||||
bd show <issue-id>
|
||||
|
||||
# Update issue status
|
||||
bd update <issue-id> --claim
|
||||
bd update <issue-id> --status done
|
||||
|
||||
# Sync with Dolt remote
|
||||
bd dolt push
|
||||
```
|
||||
|
||||
### Working with Issues
|
||||
|
||||
Issues in Beads are:
|
||||
- **Git-native**: Stored in Dolt database with version control and branching
|
||||
- **AI-friendly**: CLI-first design works perfectly with AI coding agents
|
||||
- **Branch-aware**: Issues can follow your branch workflow
|
||||
- **Always in sync**: Auto-syncs with your commits
|
||||
|
||||
## Why Beads?
|
||||
|
||||
✨ **AI-Native Design**
|
||||
- Built specifically for AI-assisted development workflows
|
||||
- CLI-first interface works seamlessly with AI coding agents
|
||||
- No context switching to web UIs
|
||||
|
||||
🚀 **Developer Focused**
|
||||
- Issues live in your repo, right next to your code
|
||||
- Works offline, syncs when you push
|
||||
- Fast, lightweight, and stays out of your way
|
||||
|
||||
🔧 **Git Integration**
|
||||
- Automatic sync with git commits
|
||||
- Branch-aware issue tracking
|
||||
- Dolt-native three-way merge resolution
|
||||
|
||||
## Get Started with Beads
|
||||
|
||||
Try Beads in your own projects:
|
||||
|
||||
```bash
|
||||
# Install Beads
|
||||
curl -sSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/install.sh | bash
|
||||
|
||||
# Initialize in your repo
|
||||
bd init
|
||||
|
||||
# Create your first issue
|
||||
bd create "Try out Beads"
|
||||
```
|
||||
|
||||
## Learn More
|
||||
|
||||
- **Documentation**: [github.com/steveyegge/beads/docs](https://github.com/steveyegge/beads/tree/main/docs)
|
||||
- **Quick Start Guide**: Run `bd quickstart`
|
||||
- **Examples**: [github.com/steveyegge/beads/examples](https://github.com/steveyegge/beads/tree/main/examples)
|
||||
|
||||
---
|
||||
|
||||
*Beads: Issue tracking that moves at the speed of thought* ⚡
|
||||
|
|
@ -1,54 +0,0 @@
|
|||
# Beads Configuration File
|
||||
# This file configures default behavior for all bd commands in this repository
|
||||
# All settings can also be set via environment variables (BD_* prefix)
|
||||
# or overridden with command-line flags
|
||||
|
||||
# Issue prefix for this repository (used by bd init)
|
||||
# If not set, bd init will auto-detect from directory name
|
||||
# Example: issue-prefix: "myproject" creates issues like "myproject-1", "myproject-2", etc.
|
||||
# issue-prefix: ""
|
||||
|
||||
# Use no-db mode: JSONL-only, no Dolt database
|
||||
# When true, bd will use .beads/issues.jsonl as the source of truth
|
||||
# no-db: false
|
||||
|
||||
# Enable JSON output by default
|
||||
# json: false
|
||||
|
||||
# Feedback title formatting for mutating commands (create/update/close/dep/edit)
|
||||
# 0 = hide titles, N > 0 = truncate to N characters
|
||||
# output:
|
||||
# title-length: 255
|
||||
|
||||
# Default actor for audit trails (overridden by BEADS_ACTOR or --actor)
|
||||
# actor: ""
|
||||
|
||||
# Export events (audit trail) to .beads/events.jsonl on each flush/sync
|
||||
# When enabled, new events are appended incrementally using a high-water mark.
|
||||
# Use 'bd export --events' to trigger manually regardless of this setting.
|
||||
# events-export: false
|
||||
|
||||
# Multi-repo configuration (experimental - bd-307)
|
||||
# Allows hydrating from multiple repositories and routing writes to the correct database
|
||||
# repos:
|
||||
# primary: "." # Primary repo (where this database lives)
|
||||
# additional: # Additional repos to hydrate from (read-only)
|
||||
# - ~/beads-planning # Personal planning repo
|
||||
# - ~/work-planning # Work planning repo
|
||||
|
||||
# JSONL backup (periodic export for off-machine recovery)
|
||||
# Auto-enabled when a git remote exists. Override explicitly:
|
||||
# backup:
|
||||
# enabled: false # Disable auto-backup entirely
|
||||
# interval: 15m # Minimum time between auto-exports
|
||||
# git-push: false # Disable git push (export locally only)
|
||||
# git-repo: "" # Separate git repo for backups (default: project repo)
|
||||
|
||||
# Integration settings (access with 'bd config get/set')
|
||||
# These are stored in the database, not in this file:
|
||||
# - jira.url
|
||||
# - jira.project
|
||||
# - linear.url
|
||||
# - linear.api-key
|
||||
# - github.org
|
||||
# - github.repo
|
||||
|
|
@ -1,9 +0,0 @@
|
|||
{
|
||||
"database": "dolt",
|
||||
"backend": "dolt",
|
||||
"dolt_mode": "server",
|
||||
"dolt_server_host": "127.0.0.1",
|
||||
"dolt_server_port": 23209,
|
||||
"dolt_database": "in",
|
||||
"project_id": "ba61c0c3-3da2-4f4d-b63c-5ab6998943f1"
|
||||
}
|
||||
|
|
@ -1,326 +0,0 @@
|
|||
# Claude Code — Project Configuration
|
||||
|
||||
> **Shared knowledge**: Read `AGENTS.md` at repo root for architecture, patterns, rules, and operations. This file adds Claude-specific features on top.
|
||||
|
||||
## Claude-Specific Resources
|
||||
- **Skills**: `.claude/skills/` (7 active). Archived runbooks: `.claude/skills/archived/`
|
||||
- **Agents**: All agents are global (`~/.claude/agents/`, shared via dotfiles). Install Viktor's dotfiles for the full set.
|
||||
- **Infra specialists**: cluster-health-checker, dba, home-automation-engineer, network-engineer, observability-engineer, platform-engineer, security-engineer, sre
|
||||
- **Incident pipeline**: post-mortem → sev-triage → sev-historian → sev-report-writer
|
||||
- **DevOps**: devops-engineer, deploy-app, review-loop
|
||||
- **Reference**: `.claude/reference/` — patterns.md, service-catalog.md, proxmox-inventory.md, github-api.md, authentik-state.md
|
||||
- **GitHub API**: `curl` with tokens from tfvars (`gh` CLI blocked by sandbox)
|
||||
|
||||
## Critical Rule: Terraform Only
|
||||
|
||||
**ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state.
|
||||
|
||||
- **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply`
|
||||
- **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward)
|
||||
- **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session
|
||||
- **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply`
|
||||
- **Helm values live in Terraform** (templatefile or inline) — never `helm upgrade` directly
|
||||
|
||||
Violations cause state drift, which causes future applies to break or silently revert changes.
|
||||
|
||||
## Instructions
|
||||
- **"remember X"**: Use `memory-tool store "content" --category facts --tags "tag1,tag2"` (via exec) for persistent cross-session memory. Also update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]`. To recall: `memory-tool recall "query"`. To list: `memory-tool list`. To delete: `memory-tool delete <id>`. The native `memory_search` and `memory_get` tools are also available for searching indexed memory files. For **storing** new memories, always use the `memory-tool` CLI via exec.
|
||||
- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies.
|
||||
- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma)
|
||||
- **New service**: Use `setup-project` skill for full workflow
|
||||
- **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?":
|
||||
- `auth = "required"` — Authentik forward-auth gates every request. Use when the backend has **no built-in user auth** and Authentik is the only thing standing between strangers and the app (prowlarr, qbittorrent, netbox, phpipam, k8s-dashboard, any admin UI shipped without its own login).
|
||||
- `auth = "app"` — the backend handles its own user authentication (NextAuth, Django, OAuth, bearer-token API, etc.); Authentik would only break it. No middleware attached; the app's own login is the gate. Examples: immich, linkwarden, tandoor, freshrss, affine, actualbudget, audiobookshelf, novelapp. **Functionally identical to `"none"`** — the distinct name exists to record intent at the call site.
|
||||
- `auth = "public"` — Authentik anonymous binding via the dedicated `public` outpost (routes via `traefik-authentik-forward-auth-public` → `ak-outpost-public.authentik.svc:9000`). Strangers auto-bound to `guest`; logged-in users keep their identity in `X-authentik-username`. **Only works for top-level browser navigation** — CORS preflight rejects XHR/fetch and automation can't replay the cookie dance. Audit trail, not a gate.
|
||||
- `auth = "none"` — no Authentik, no own-auth claim. Use for Anubis-fronted content (Anubis is the gate), native-client APIs (Git, `/v2/`, WebDAV/CalDAV, CardDAV), webhook receivers, OAuth callbacks, and Authentik outposts themselves.
|
||||
- **Anti-exposure rule** (the reason `"app"` exists): only pick `"app"` or `"none"` AFTER you've verified the app has its own user auth (`"app"`) OR the endpoint is intentionally public (`"none"`). Default is `"required"` so accidental omission fails closed. **Convention**: when using `"app"` or `"none"`, add a comment line above the `auth = "..."` line stating what gates the app or why it's public. **Enforced by `scripts/tg`**: every `tg plan/apply/destroy/refresh` runs `scripts/check-ingress-auth-comments.py` against the current stack and aborts if any `auth = "app|none"` line lacks the preceding `# auth = "<tier>": ...` comment. Stack-scoped — untouched stacks aren't blocked until they're next edited.
|
||||
- **Anti-AI**: on by default when `auth = "none"` or `auth = "app"` (no Authentik to discourage bots); redundant on `"required"` and `"public"`.
|
||||
- **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend).
|
||||
- **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<backend>.<ns>.svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/<path>"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering.
|
||||
- **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern.
|
||||
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.203` (with `skip_verify = true`, since the node dials Traefik by IP but the cert is for `forgejo.viktorbarzin.me`) to avoid hairpin NAT. That redirect covers **kubelet pulls** only — in-cluster pods (notably Woodpecker buildkit build pods pushing images) resolve `forgejo.viktorbarzin.me` via a CoreDNS `rewrite name exact ... traefik.traefik.svc.cluster.local` (Corefile in `stacks/technitium/modules/technitium/main.tf`), since they do NOT use the node containerd mirror; without it, buildkit pushes intermittently timed out on the public-IP hairpin (added 2026-06-04, beads code-yh33). **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left this redirect pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull (cached images kept running, so it stayed hidden until a new image tag was pulled). Redirect source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
|
||||
- **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
|
||||
- **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
|
||||
- **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`.
|
||||
- **Sealed Secrets**: User-managed secrets go in `sealed-*.yaml` files in the stack directory. Stacks pick them up via `kubernetes_manifest` + `fileset(path.module, "sealed-*.yaml")`. See AGENTS.md for full workflow.
|
||||
- **CRITICAL — Update docs with every change**: When modifying infrastructure (Terraform, Vault, networking, storage, CI/CD, monitoring), you MUST update all affected documentation in the same commit. Check and update: `docs/architecture/*.md`, `docs/runbooks/*.md`, `.claude/CLAUDE.md`, `AGENTS.md`, `.claude/reference/service-catalog.md`. Stale docs cause incident response failures and onboarding confusion. If unsure which docs are affected, grep for the service/resource name across all doc files.
|
||||
|
||||
## Terraform State — Two-Tier Backend
|
||||
- **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable.
|
||||
- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema.
|
||||
- **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`).
|
||||
- **Tier 0 workflow** (unchanged): `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`. State sync via SOPS is transparent.
|
||||
- **Tier 1 workflow**: `vault login -method=oidc` → `scripts/tg plan` → `scripts/tg apply`. No git commit needed — PG is authoritative.
|
||||
- **Tier detection**: Defined in `terragrunt.hcl` (`locals.tier0_stacks`), `scripts/tg`, and `scripts/state-sync`. All three share the same list.
|
||||
- **Fallback**: If PG is down, Tier 0 local state can bring it back (`scripts/tg apply` in `dbaas` stack). Tier 1 ops are blocked until PG recovers.
|
||||
- **Tier 0 details**: Decrypt priority: Vault Transit (primary) → age key fallback. Encrypt: both Vault Transit + age recipients. Scripts: `scripts/state-sync {encrypt|decrypt|commit} [stack]`.
|
||||
- **Adding operator**: Generate age key (`age-keygen`), add pubkey to `.sops.yaml`, run `sops updatekeys` on Tier 0 `.enc` files. For Tier 1, only Vault access is needed.
|
||||
- **Migration script**: `scripts/migrate-state-to-pg` (one-shot, idempotent) migrates Tier 1 stacks from local to PG.
|
||||
- **Adopting existing resources**: use HCL `import {}` blocks (TF 1.5+), not `terraform import` CLI. Commit stanza → plan-to-zero → apply → delete stanza. Canonical reason: reviewable in PR, plan-safe, idempotent, tier-agnostic. Full rules + per-provider ID formats in `AGENTS.md` → "Adopting Existing Resources".
|
||||
|
||||
## Secrets Management — Vault KV
|
||||
- **Vault is the sole source of truth** for secrets.
|
||||
- **`secret/viktor`** — go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first: `vault kv get -field=KEY secret/viktor`.
|
||||
- **Auth**: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault TF provider.
|
||||
- **Vault stack self-reads**: `data "vault_kv_secret_v2" "vault"` reads its own OIDC creds from `secret/vault`.
|
||||
- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`.
|
||||
- **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts.
|
||||
- **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules.
|
||||
- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: <secret>`) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr); matrix has since migrated to tuwunel (RocksDB, no Postgres) on 2026-06-08 and is no longer in the rotation list above. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances.
|
||||
- **K8s credentials**: Vault K8s secrets engine. Roles: `dashboard-admin`, `ci-deployer`, `openclaw`, `local-admin`. Use `vault write kubernetes/creds/ROLE kubernetes_namespace=NS`. Helper: `scripts/vault-kubeconfig`.
|
||||
- **CI/CD (GHA + Woodpecker)**: Docker builds run on **GitHub Actions** (free on public repos). Woodpecker is **deploy-only** — receives image tag via API POST, runs `kubectl set image`. Woodpecker authenticates via K8s SA JWT → Vault K8s auth. Sync CronJob pushes `secret/ci/global` → Woodpecker API every 6h. Shell scripts in HCL heredocs: escape `$` → `$$`, `%{}` → `%%{}`.
|
||||
- **Platform cannot depend on vault** (circular). Apply order: vault first, then platform. Platform has 48 vault refs, all in module inputs — no ESO migration possible.
|
||||
- **Complex types** (maps/lists like `homepage_credentials`, `k8s_users`) stored as JSON strings in KV, decoded with `jsondecode()` in consuming stack `locals` blocks.
|
||||
- **New stacks**: Add secret in Vault UI/CLI at `secret/<stack-name>`, add ExternalSecret + `data "kubernetes_secret"` for plan-time, `secret_key_ref` for env vars. Use `data "vault_kv_secret_v2"` only if `data "kubernetes_secret"` won't work (e.g., first-apply bootstrap).
|
||||
- **Backup CronJob**: `vault-raft-backup` uses manually-created `vault-root-token` K8s Secret (independent of automation).
|
||||
- **Bootstrap (fresh cluster)**: Comment out data source + OIDC → apply Helm → init+unseal → populate `secret/vault` → uncomment → re-apply.
|
||||
|
||||
## Resource Management Patterns
|
||||
- **CPU**: All CPU limits removed cluster-wide (CFS throttling). Only set CPU requests based on actual usage.
|
||||
- **Memory**: Set explicit `requests=limits` based on VPA upperBound. Target: upperBound x 1.2 for stable services, x 1.3 for GPU/volatile workloads.
|
||||
- **VPA (Goldilocks)**: Must be `Initial` mode (not `Auto`) — Auto conflicts with Terraform's declarative resource management.
|
||||
- **LimitRange**: Tier-based defaults silently apply to pods with `resources: {}`. Always set explicit resources on containers needing more than defaults. Tier 3-edge and 4-aux now use Burstable QoS (request < limit) to reduce scheduler pressure.
|
||||
- **Democratic-CSI sidecars**: Must set explicit resources (32-80Mi) in Helm values — 17 sidecars default to 256Mi each via LimitRange. `csiProxy` is a TOP-LEVEL chart key, not nested under controller/node.
|
||||
- **ResourceQuota blocks rolling updates**: When quota is tight, scale to 0 then back to 1 instead of RollingUpdate. Or use Recreate strategy.
|
||||
- **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Every `kubernetes_deployment`, `kubernetes_stateful_set`, and `kubernetes_cron_job_v1` MUST include `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 }` (use `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` for CronJobs). The `# KYVERNO_LIFECYCLE_V1` marker is the canonical discoverability tag — grep for it to locate every site. A shared Terraform module was considered but `ignore_changes` only accepts static attribute paths (not module outputs, locals, or expressions), so the snippet convention is the only viable path. Full rationale and copy-paste snippets in `AGENTS.md` → "Kyverno Drift Suppression".
|
||||
- **NVIDIA GPU operator resources**: dcgm-exporter and cuda-validator resources configurable via `dcgmExporter.resources` and `validator.resources` in nvidia values.yaml.
|
||||
- **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis.
|
||||
- **Quarterly right-sizing**: Check Goldilocks dashboard. Compare VPA upperBound to current request. Also check for under-provisioned (VPA upper > request x 0.8).
|
||||
|
||||
## CI/CD Architecture — GHA Builds + Woodpecker Deploy
|
||||
|
||||
**Owned-app deploy model (build triggers the rollout — 2026-06-02):** For
|
||||
self-hosted apps **we build** (Forgejo `viktor/<name>` + Dockerfile +
|
||||
`.woodpecker.yml`), the build pipeline ALSO drives the rollout — atomic +
|
||||
deterministic, no wait for Keel's poll. Pattern (`build-and-push` tags `latest`
|
||||
+ `${CI_COMMIT_SHA:0:8}`, then a `deploy` step): `kubectl set image
|
||||
deployment/<app> <container>=<repo>:${CI_COMMIT_SHA:0:8} -n <ns>` +
|
||||
`kubectl rollout status ... --timeout=300s`. The `woodpecker-agent` SA is
|
||||
`cluster-admin`, so the `bitnami/kubectl` step needs no kubeconfig/RBAC (uses
|
||||
its in-cluster SA). **Keel stays enrolled in parallel** as a redundant net
|
||||
(finds the deployed SHA already running → no-op). Requires the Deployment to
|
||||
have `ignore_changes` on `…container[0].image` (KEEL_IGNORE_IMAGE) so CI
|
||||
`set image` doesn't fight `terragrunt apply`. CronJobs in owned apps use
|
||||
`:latest` + `imagePullPolicy: Always` (fresh pod each run) instead of a deploy
|
||||
step. **Never** `set image`/`rollout restart` operator-managed StatefulSets
|
||||
(memory id=740). Reference impls: `tuya_bridge/.woodpecker.yml`,
|
||||
`job-hunter`, `f1-stream` (viktor/f1-stream, extracted from this monorepo
|
||||
2026-06-05). This reverses decision #12 of
|
||||
`docs/plans/2026-05-16-auto-upgrade-apps-design.md` for owned (not upstream)
|
||||
images.
|
||||
|
||||
**Flow (GHA-migrated apps)**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image`
|
||||
|
||||
**Migrated to GHA** (9): Website, k8s-portal, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints
|
||||
**Woodpecker-native owned-app build** (Forgejo registry, build->deploy in one `.woodpecker.yml`): tuya_bridge, job-hunter, f1-stream (extracted to viktor/f1-stream 2026-06-05; Woodpecker repo id 166; the old github source is archived + its GHA repo-id-10 deactivated)
|
||||
**Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access)
|
||||
|
||||
**Per-project files**:
|
||||
- `.github/workflows/build-and-deploy.yml` — GHA: checkout, build, push DockerHub, POST Woodpecker API
|
||||
- `.woodpecker/deploy.yml` — Woodpecker: `kubectl set image` + Slack notify (event: `[manual, push]`)
|
||||
- `.woodpecker/build-fallback.yml` — Old full build pipeline preserved (event: `deployment` — never auto-fires)
|
||||
|
||||
**Woodpecker API**: Uses **numeric repo IDs** (`/api/repos/2/pipelines`), NOT owner/name paths (those return HTML).
|
||||
Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79, council-complaints=TBD (f1-stream's old GHA-era github repo id 10 is deactivated; it's now a Woodpecker-native Forgejo build at repo id 166)
|
||||
|
||||
**Woodpecker YAML gotchas**:
|
||||
- Commands with `${VAR}:${VAR}` must be **quoted** — unquoted `:` triggers YAML map parsing when vars are empty
|
||||
- Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility issues)
|
||||
- Global secrets must have `manual` in their events list for API-triggered pipelines
|
||||
|
||||
**GitHub repo secrets** (set on all repos): `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN`, `WOODPECKER_TOKEN`
|
||||
|
||||
**Infra pipelines unchanged**: `default.yml` (terragrunt apply), `renew-tls.yml` (certbot cron), `build-cli.yml` (dual registry push), `k8s-portal.yml` (path-filtered build), `provision-user.yml` — all stay on Woodpecker.
|
||||
|
||||
## Database Host
|
||||
|
||||
**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service has no endpoints — never use it. This variable is shared by ~12 stacks.
|
||||
|
||||
**CNPG tuning** (in `stacks/dbaas/modules/dbaas/main.tf`): `shared_buffers=512MB`, `work_mem=16MB`, `wal_compression=on`, `effective_cache_size=1536MB`, pod memory 2Gi.
|
||||
|
||||
## Networking & Resilience
|
||||
- **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared.
|
||||
- **PDBs**: minAvailable=2 on Traefik and Authentik.
|
||||
- **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
|
||||
- **CrowdSec bouncer**: graceful degradation mode (fail-open on error).
|
||||
- **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits.
|
||||
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
|
||||
- **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts".
|
||||
- **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge.
|
||||
- **Traefik LB IP = `10.0.20.203`, `externalTrafficPolicy: Local`** (dedicated, NOT the shared `.200`). Moved off the shared `.200` on 2026-05-30 so direct/non-proxied apps preserve the **real client IP for CrowdSec** (ETP=Cluster SNAT'd them to the node IP) and so QUIC works. **The shared `10.0.20.200` keeps the other 10 LB services** (PG state-backend `postgresql-lb`, headscale, wireguard, coturn, xray, etc. — all ETP=Cluster; MetalLB forbids mixed ETP on a shared IP, hence Traefik's own IP). **cloudflared targets the in-cluster Traefik Service** (`https://traefik.traefik.svc.cluster.local:443`, remote/dashboard tunnel config — edit via CF Global API Key in `secret/platform`), so proxied apps are decoupled from the LB IP. pfSense WAN 443 (tcp+udp) NAT → alias `traefik_lb` (`.203`). Internal split-horizon apex `viktorbarzin.me A` → `.203`. Full runbook + post-mortem: `docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*`.
|
||||
- **IPv6 ingress** = HE 6in4 tunnel (`2001:470:6e:43d::2`) → **standalone HAProxy on pfSense** (`/usr/local/etc/ipv6-haproxy.cfg`, NOT the HAProxy package) using `send-proxy-v2` → Traefik `.203` (web 443/80) + mail NodePorts `30125-30128` (25/465/587/993) — so **real IPv6 client IPs reach CrowdSec**. Traefik trusts PROXY-v2 **only from `10.0.20.1`** (`entryPoints.web/websecure.proxyProtocol.trustedIPs`); real IPv4 clients (own source IP) unaffected. **No QUIC over IPv6** (bridge is TCP/h2). Replaced socat 2026-05-30 (socat masked every v6 client as `10.0.20.1`). Boot/persistence: config.xml `<shellcmd>` → `ipv6_proxy.sh` (patches nginx off `[::]:443/:80` to free the tunnel IPv6, then `service ipv6proxy onestart`); `rc.d/ipv6proxy` manages HAProxy. Backends use **no health `check`** (a plain TCP check false-DOWNs the PROXY-expecting listeners). As-built: `docs/architecture/networking.md` → "IPv6 Ingress".
|
||||
- **IPAM & DNS auto-registration**: pfSense Kea DHCP serves all 3 subnets (VLAN 10, VLAN 20, 192.168.1.x). Kea DDNS auto-registers every DHCP client in Technitium (RFC 2136, A+PTR). CronJob `phpipam-pfsense-import` (hourly) pulls Kea leases + ARP into phpIPAM via SSH (passive, no scanning). CronJob `phpipam-dns-sync` (15min) bidirectional sync phpIPAM ↔ Technitium. 42 MAC reservations for 192.168.1.x.
|
||||
|
||||
## Service-Specific Notes
|
||||
| Service | Key Operational Knowledge |
|
||||
|---------|--------------------------|
|
||||
| Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe |
|
||||
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
|
||||
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob |
|
||||
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
|
||||
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
|
||||
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
|
||||
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
|
||||
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
|
||||
|
||||
## Monitoring & Alerting
|
||||
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
|
||||
- Exclude completed CronJob pods from "pod not ready" alerts.
|
||||
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
|
||||
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
|
||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
||||
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
||||
|
||||
## Security Posture (Wave 1 — locked 2026-05-18)
|
||||
|
||||
Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`.
|
||||
|
||||
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
|
||||
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale.
|
||||
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
|
||||
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
|
||||
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
|
||||
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
|
||||
|
||||
## Storage & Backup Architecture
|
||||
|
||||
### Storage Class Decision Rule (for new services)
|
||||
|
||||
Choose storage class based on workload type:
|
||||
|
||||
| Use **proxmox-lvm-encrypted** when | Use **proxmox-lvm** when | Use **NFS** (`nfs_volume` module) when |
|
||||
|------------------------------------|--------------------------|----------------------------------------|
|
||||
| **Any service storing sensitive data** | Non-sensitive app state (configs, caches) | Shared data across multiple pods (RWX) |
|
||||
| Databases (user data, credentials) | Media indexes, search caches | Media libraries (music, ebooks, photos) |
|
||||
| Auth/identity services | Monitoring data (Prometheus) | Backup destinations (cloud sync picks up from NFS) |
|
||||
| Password managers, email, git repos | Tools with no user secrets | Large datasets (>10Gi) where snapshots matter |
|
||||
| Health/financial data | | Data you want to browse/inspect from outside k8s |
|
||||
|
||||
**Default for sensitive data is proxmox-lvm-encrypted.** Use plain `proxmox-lvm` only for non-sensitive workloads. Use NFS when you need RWX, backup pipeline integration, or it's a large shared media library.
|
||||
|
||||
**NFS server:**
|
||||
- **Proxmox host** (192.168.1.127): Sole NFS for all workloads. HDD at `/srv/nfs` (ext4 thin LV `pve/nfs-data`, 3 TB). SSD at `/srv/nfs-ssd` (ext4 LV `ssd/nfs-ssd-data`, 100GB). Exports use `async,insecure` options (`async` — safe with UPS + Vault Raft replication + databases on block storage; `insecure` — pfSense NATs source ports >1024 between VLANs).
|
||||
- **Nextcloud as NFS browser**: Nextcloud (`nextcloud.viktorbarzin.me`) mounts the PVE NFS roots (`/srv/nfs`, `/srv/nfs-ssd`) inside the NC pod at `/mnt/pve-nfs` + `/mnt/pve-nfs-ssd`. Surfaced to users via two ACL patterns: (1) admin-only root browsers `PVE NFS Pool` + `PVE NFS-SSD Pool` (scoped to NC group `admin`); (2) per-archive mounts (e.g. `/anca-elements`) with `applicable_users` set to the owners. ACL is at the mount level via `occ files_external:applicable` — Files Access Control is NOT used (NC 30/31's workflow engine lacks FilePath / UserId checks). Manifest lives in `kubernetes_config_map_v1.nextcloud_external_storage_manifest` (`stacks/nextcloud/external_storage.tf`); a one-shot K8s Job applies it idempotently.
|
||||
- **`nfs-truenas` StorageClass**: Historical name retained only because SC names are immutable on PVs (48 bound PVs reference it — renaming would require mass PV churn, not worth it). Now points to the Proxmox host (`nfs.csi.k8s.io` dynamic provisioning on `192.168.1.127:/srv/nfs`). TrueNAS (VM 9000, 10.0.10.15) operationally decommissioned 2026-04-13; VM still exists in stopped state on PVE pending user decision on deletion.
|
||||
|
||||
**Migration note**: CSI PV `volumeAttributes` are immutable — cannot update NFS server in place. New PV/PVC pairs required (convention: append `-host` to PV name).
|
||||
|
||||
**NFS CSI mount option requirements** (learned from [PM-2026-04-14]):
|
||||
- **ALWAYS set `nfsvers=4`** in CSI mount options. NFSv3 is disabled on the PVE host (`vers3=n` in `/etc/nfs.conf`). Without this, mounts fail silently if kernel NFS client state is corrupt.
|
||||
- **NEVER use `fsid=0`** in `/etc/exports` on `/srv/nfs`. `fsid=0` designates the NFSv4 pseudo-root, which breaks subdirectory path resolution for all CSI mounts. Only `fsid=1` (unique ID) is safe on `/srv/nfs-ssd`.
|
||||
- **`/etc/exports` is git-managed** at `infra/scripts/pve-nfs-exports`. Deploy: `scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra`
|
||||
- **Critical services MUST NOT use NFS storage** — circular dependency risk. Alertmanager, Prometheus, and any monitoring that should alert about NFS must use `proxmox-lvm-encrypted`. Technitium DNS primary uses `proxmox-lvm-encrypted` (migrated 2026-04-14).
|
||||
- **NFS PV template** (in `modules/kubernetes/nfs_volume/`): always include `mountOptions: ["nfsvers=4", "soft", "actimeo=5", "retrans=3", "timeo=30"]`
|
||||
|
||||
**proxmox-lvm PVC template** (Terraform):
|
||||
```hcl
|
||||
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
|
||||
wait_until_bound = false
|
||||
metadata {
|
||||
name = "<service>-data-proxmox"
|
||||
namespace = kubernetes_namespace.<ns>.metadata[0].name
|
||||
annotations = {
|
||||
"resize.topolvm.io/threshold" = "10%"
|
||||
"resize.topolvm.io/increase" = "100%"
|
||||
"resize.topolvm.io/storage_limit" = "5Gi"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
access_modes = ["ReadWriteOnce"]
|
||||
storage_class_name = "proxmox-lvm"
|
||||
resources {
|
||||
requests = { storage = "1Gi" }
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# pvc-autoresizer expands this PVC up to storage_limit; ignore drift on
|
||||
# requests.storage so the next TF apply doesn't try to shrink it back
|
||||
# (K8s rejects shrinks → apply fails). To bump the floor manually:
|
||||
# temporarily remove this block, apply the new size, re-add the block,
|
||||
# apply again.
|
||||
ignore_changes = [spec[0].resources[0].requests]
|
||||
}
|
||||
}
|
||||
```
|
||||
- `wait_until_bound = false` is **required** (WaitForFirstConsumer binding)
|
||||
- Deployment strategy **must be Recreate** (RWO volumes)
|
||||
- Autoresizer annotations are **required** on all proxmox-lvm PVCs
|
||||
- `lifecycle.ignore_changes` on `requests` is **required** to coexist with the autoresizer
|
||||
- Every proxmox-lvm app **MUST** add a backup CronJob writing to NFS `/mnt/main/<app>-backup/`
|
||||
|
||||
**proxmox-lvm-encrypted PVC template** (Terraform) — use for all sensitive data:
|
||||
```hcl
|
||||
resource "kubernetes_persistent_volume_claim" "data_encrypted" {
|
||||
wait_until_bound = false
|
||||
metadata {
|
||||
name = "<service>-data-encrypted"
|
||||
namespace = kubernetes_namespace.<ns>.metadata[0].name
|
||||
annotations = {
|
||||
"resize.topolvm.io/threshold" = "10%"
|
||||
"resize.topolvm.io/increase" = "100%"
|
||||
"resize.topolvm.io/storage_limit" = "5Gi"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
access_modes = ["ReadWriteOnce"]
|
||||
storage_class_name = "proxmox-lvm-encrypted"
|
||||
resources {
|
||||
requests = { storage = "1Gi" }
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# See data_proxmox above — required for autoresizer coexistence.
|
||||
ignore_changes = [spec[0].resources[0].requests]
|
||||
}
|
||||
}
|
||||
```
|
||||
- Same rules as `proxmox-lvm` (wait_until_bound, Recreate strategy, autoresizer, backup CronJob, `lifecycle.ignore_changes`)
|
||||
- Uses LUKS2 encryption with Argon2id key derivation via Proxmox CSI plugin
|
||||
- Encryption passphrase stored in Vault KV (`secret/viktor/proxmox_csi_encryption_passphrase`), synced to K8s Secret `proxmox-csi-encryption` in `kube-system` via ExternalSecret
|
||||
- Backup key at `/root/.luks-backup-key` on PVE host (chmod 600)
|
||||
- CSI node plugin needs 1280Mi memory limit for LUKS operations (`node.plugin.resources` in Helm values)
|
||||
- Convention: PVC names end in `-encrypted` (not `-proxmox`)
|
||||
|
||||
### 3-2-1 Backup Strategy
|
||||
**Copy 1**: Live data on sdc thin pool (65 PVCs + VMs)
|
||||
**Copy 2**: sda backup disk (`/mnt/backup`, 1.1TB ext4, VG `backup`)
|
||||
**Copy 3**: Synology NAS offsite (two-tier: sda + NFS)
|
||||
|
||||
**PVE host scripts** (source: `infra/scripts/`; deployed manually via `scp` to `/usr/local/bin/<name>` — strip the `.sh`):
|
||||
- `/usr/local/bin/nfs-mirror` — Daily 02:00. `rsync --delete /srv/nfs/<svc>/ → /mnt/backup/<svc>/` (sda leg 1), appends transferred paths to `/mnt/backup/.changed-files` for offsite Step 1. **EXCLUDES**: immich (too big — direct leg), frigate/temp (no backup), anca-elements (in Immich), and **(2026-06-01) ollama, prometheus-backup, audiblez, ebook2audiobook** — regenerable, live-only on sdc, kept off the space-constrained offsite. Does NOT mirror `/srv/nfs-ssd`.
|
||||
- `/usr/local/bin/daily-backup` — Daily 05:00. Mounts LVM thin snapshots ro → rsyncs FILES to `/mnt/backup/pvc-data/<YYYY-WW>/<ns>/<pvc>/` with `--link-dest` versioning (4 weeks). Auto SQLite backup (magic number check, `?mode=ro`). Also backs up pfSense (config.xml + tar), PVE config. Prunes snapshots >7d. **Skip-list (2026-06-01)**: `nextcloud/nextcloud-data-proxmox` (orphaned pre-encryption PV).
|
||||
- `/usr/local/bin/offsite-sync-backup` — Daily 06:00 (After=daily-backup). Step 1: sda → Synology `pve-backup/` (incremental via manifest; monthly full `rsync --delete` days 1–7). Step 2: NFS direct → Synology — **immich-only on BOTH `nfs/` and `nfs-ssd/` (2026-06-01)**; ollama/llamacpp on the SSD no longer ship offsite.
|
||||
- `/usr/local/bin/lvm-pvc-snapshot` — Daily 03:00. Thin snapshots of all PVCs except dbaas+monitoring. 7-day retention. Instant restore: `lvm-pvc-snapshot restore <lv> <snap>`.
|
||||
- `nfs-change-tracker.service` — Continuous inotifywait on `/srv/nfs` + `/srv/nfs-ssd`. Logs changed file paths to `/mnt/backup/.nfs-changes.log`. Consumed by offsite-sync-backup for incremental rsync (completes in seconds instead of 30+ minutes).
|
||||
|
||||
**Synology layout** (`192.168.1.13:/volume1/Backup/Viki/`):
|
||||
- `pve-backup/` — PVC file backups (`pvc-data/`), SQLite backups (`sqlite-backup/`), pfSense, PVE config (synced from sda)
|
||||
- `nfs/` — mirrors `/srv/nfs` on Proxmox (inotify change-tracked rsync)
|
||||
- `nfs-ssd/` — mirrors `/srv/nfs-ssd` on Proxmox (inotify change-tracked rsync)
|
||||
|
||||
**App-level CronJobs** (write to Proxmox host NFS, synced to Synology via inotify):
|
||||
- MySQL (daily full + per-db), PostgreSQL (daily full + per-db), Vault (weekly), Vaultwarden (6h + integrity), Redis (weekly), etcd (weekly)
|
||||
- **Per-database backups**: `postgresql-backup-per-db` (00:15, `pg_dump -Fc` → `/backup/per-db/<db>/`) and `mysql-backup-per-db` (00:45, `mysqldump` → `/backup/per-db/<db>/`). Enables single-database restore without affecting others.
|
||||
- **Convention**: New proxmox-lvm apps MUST add a backup CronJob writing to `/mnt/main/<app>-backup/`
|
||||
|
||||
**Restore paths**:
|
||||
- Single database: `pg_restore -d <db> --clean --if-exists` (PG) or `mysql <db> < dump.sql.gz` (MySQL) from per-db backup
|
||||
- Accidental delete: `lvm-pvc-snapshot restore` (instant, 7 daily snapshots)
|
||||
- Older data: Browse `/mnt/backup/pvc-data/<week>/<ns>/<pvc>/`, rsync back
|
||||
- Database (full cluster): Restore from dump at `/srv/nfs/<db>-backup/` or Synology `nfs/<db>-backup/`
|
||||
- pfsense: Upload config.xml via web UI, or extract tar for custom scripts
|
||||
- Full disaster: Restore from Synology
|
||||
|
||||
## Known Issues
|
||||
- **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec <rev> -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation.
|
||||
- **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`. **`mcp.servers` baked into the ConfigMap-loaded openclaw.json gets stripped by `doctor --fix`** — register MCP servers via `openclaw mcp set <name> <json>` in the container startup command instead (CLI-written entries persist across doctor runs). Current servers wired this way: `ha`, `context7`, `playwright` (sidecar at `localhost:3000/mcp`).
|
||||
- **OpenClaw memory-core indexes `/workspace/memory/`, not `/home/node/.openclaw/memory/`**: `/home/node/.openclaw/memory/main.sqlite` is the index store, NOT a content source. Files written under `/home/node/.openclaw/memory/projects/<x>/*.md` will NOT be indexed. To populate memory-core, write Markdown under `/workspace/memory/projects/<source>/` and run `openclaw memory index --force`. This is what the daily `memory-sync` CronJob in `stacks/openclaw/` does for claude-memory → OpenClaw sync.
|
||||
- **Goldilocks VPA sets limits**: When increasing memory requests, always set explicit `limits` too — Goldilocks may have added a limit that blocks the change.
|
||||
|
||||
## User Preferences
|
||||
- **Calendar**: Nextcloud at `nextcloud.viktorbarzin.me`
|
||||
- **Home Assistant**: ha-london (default), ha-sofia. "ha"/"HA" = ha-london
|
||||
- **Frontend**: Svelte for all new web apps
|
||||
- **Tools**: Docker containers only — never `brew install` locally
|
||||
- **Pod monitoring**: Never use `sleep` — spawn background subagent with `kubectl get pods -w`
|
||||
|
|
@ -1,180 +0,0 @@
|
|||
---
|
||||
name: issue-responder
|
||||
description: "Automated infra team: reads GitHub Issues (incidents + feature requests), investigates, resolves if confident, escalates if complex."
|
||||
model: opus
|
||||
allowedTools:
|
||||
- Read
|
||||
- Edit
|
||||
- Write
|
||||
- Bash
|
||||
- Grep
|
||||
- Glob
|
||||
- Agent
|
||||
---
|
||||
|
||||
You are the automated infra team responder for ViktorBarzin/infra. You receive a GitHub Issue (incident report or feature request), investigate, and take action.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **GitHub repo**: `ViktorBarzin/infra`
|
||||
- **GitHub PAT**: `vault kv get -field=github_pat secret/viktor`
|
||||
- **Cluster context script**: `/home/wizard/code/infra/.claude/scripts/sev-context.sh`
|
||||
- **Post-mortem agents**: `/home/wizard/code/infra/.claude/agents/post-mortem.md` (4-stage pipeline)
|
||||
- **Service catalog**: `/home/wizard/code/infra/.claude/reference/service-catalog.md`
|
||||
- **Terraform apply**: `cd /home/wizard/code/infra/stacks/<stack> && ../../scripts/tg apply --non-interactive`
|
||||
|
||||
## Input
|
||||
|
||||
You receive a prompt like:
|
||||
> Process GitHub Issue #N: <title>. Labels: <labels>. URL: <url>. Read the issue body via GitHub API, investigate, and take appropriate action.
|
||||
|
||||
## Step 1: Read the Issue
|
||||
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
curl -s -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" | python3 -c "
|
||||
import sys, json
|
||||
d = json.load(sys.stdin)
|
||||
print(f'Title: {d[\"title\"]}')
|
||||
print(f'Author: {d[\"user\"][\"login\"]}')
|
||||
print(f'Labels: {[l[\"name\"] for l in d[\"labels\"]]}')
|
||||
print(f'State: {d[\"state\"]}')
|
||||
print(f'Body:\n{d[\"body\"]}')
|
||||
"
|
||||
```
|
||||
|
||||
## Step 2: Classify and Route
|
||||
|
||||
Based on labels:
|
||||
- `user-report` → **Incident Response** (Step 3A)
|
||||
- `feature-request` → **Feature Implementation** (Step 3B)
|
||||
- Neither → Read the issue body, determine which it is, add the appropriate label, then route
|
||||
|
||||
## Step 3A: Incident Response
|
||||
|
||||
1. **Verify the issue is real**:
|
||||
- Run `bash /home/wizard/code/infra/.claude/scripts/sev-context.sh` for cluster state
|
||||
- Check if the reported service is actually down: `kubectl get pods -n <namespace>`, check Uptime Kuma
|
||||
- If service appears healthy: comment "Service appears healthy from our monitoring. Could you provide more details or check again?" and close the issue
|
||||
|
||||
2. **If service is down**:
|
||||
- Classify severity:
|
||||
- **SEV1**: Node down, multiple services affected, data at risk, or complete outage of a core service (DNS, auth, ingress)
|
||||
- **SEV2**: Single service down, degraded performance, or non-core service outage
|
||||
- **SEV3**: Minor issue, cosmetic, or affecting only optional services
|
||||
- Add labels: `incident` + `sev1`/`sev2`/`sev3` + `postmortem-required` (for SEV1/SEV2)
|
||||
- Comment on the issue: "Investigating. Severity classified as SEV<N>."
|
||||
|
||||
3. **Attempt resolution** (if confident):
|
||||
- Check pod logs, events, recent deployments for obvious causes
|
||||
- Common fixes you CAN do:
|
||||
- Restart a stuck pod: `kubectl delete pod -n <ns> <pod>`
|
||||
- Scale deployment back up if scaled to 0
|
||||
- Fix obvious Terraform config issues (wrong image tag, resource limits)
|
||||
- Apply Terraform: `cd stacks/<stack> && ../../scripts/tg apply --non-interactive`
|
||||
- If you fix it: comment with what was done, how it was resolved
|
||||
- If you can't fix it or it's complex: escalate (see Step 4)
|
||||
|
||||
4. **For SEV1/SEV2**: Spawn the post-mortem pipeline via Agent tool:
|
||||
```
|
||||
Agent(subagent_type="general-purpose", prompt="Run the post-mortem agent pipeline for issue #N...")
|
||||
```
|
||||
|
||||
## Step 3B: Feature Implementation
|
||||
|
||||
1. **Assess complexity**:
|
||||
- Read the request carefully
|
||||
- Check if it's a known pattern (deploy a service, add a monitor, config change)
|
||||
- Check existing stacks in `stacks/` for similar services as reference
|
||||
|
||||
2. **If trivial** (you're confident you can implement correctly):
|
||||
- Implement the change in Terraform
|
||||
- **Always run `scripts/tg plan`** before apply — check for unexpected changes
|
||||
- If plan looks clean: apply via `scripts/tg apply --non-interactive`
|
||||
- Commit: `git add <files> && git commit -m "feat: <description> (fixes #N)"`
|
||||
- Push: `git push origin master`
|
||||
- Comment on the issue with what was implemented
|
||||
- Close the issue
|
||||
|
||||
3. **If complex** (new architecture, unknown service, multi-stack changes, data migration):
|
||||
- Comment with your assessment: what's needed, estimated complexity, any risks
|
||||
- Escalate (see Step 4)
|
||||
|
||||
## Step 4: Escalate
|
||||
|
||||
When you can't confidently resolve an issue:
|
||||
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
|
||||
# Add needs-human label
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" \
|
||||
-d '{"labels": ["needs-human"]}'
|
||||
|
||||
# Assign to Viktor
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/assignees" \
|
||||
-d '{"assignees": ["ViktorBarzin"]}'
|
||||
|
||||
# Comment explaining why
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
|
||||
-d "{\"body\": \"**Escalating to @ViktorBarzin** — <reason>\\n\\n**What I found:**\\n<findings>\\n\\n**Why I can't resolve this:**\\n<reason>\"}"
|
||||
```
|
||||
|
||||
## Safety Rules
|
||||
|
||||
1. **Never delete PVCs, PVs, or user data**
|
||||
2. **Never modify Vault secrets directly** — use Terraform + ExternalSecrets
|
||||
3. **Never force-push or git reset**
|
||||
4. **Never apply changes that could cause downtime to HEALTHY services**
|
||||
5. **Always `scripts/tg plan` before `scripts/tg apply`** — if plan shows destroys > 0, ESCALATE
|
||||
6. **Never modify platform stacks** (vault, dbaas, traefik, authentik, kyverno) — ESCALATE these
|
||||
7. **All changes go through Terraform** — never kubectl apply/edit/patch as final state
|
||||
8. **Max budget**: $10 per issue. If you need more, escalate.
|
||||
9. **All commits reference the issue**: `fixes #N` or `ref #N`
|
||||
|
||||
## Communication
|
||||
|
||||
All updates go as GitHub Issue comments. Use this format:
|
||||
|
||||
**Starting investigation:**
|
||||
> Investigating issue #N. Running cluster diagnostics...
|
||||
|
||||
**Findings:**
|
||||
> **Findings:** <what you found>
|
||||
> - Pod `X` in namespace `Y` is in CrashLoopBackOff
|
||||
> - Last restart: 15 minutes ago
|
||||
> - Error in logs: `<error>`
|
||||
|
||||
**Resolution:**
|
||||
> **Resolved:** <what was done>
|
||||
> - Restarted pod `X` — service recovered
|
||||
> - Root cause: OOM kill due to memory limit. Increased limit from 512Mi to 1Gi.
|
||||
> - Commit: `abc1234`
|
||||
|
||||
**Escalation:**
|
||||
> **Escalating to @ViktorBarzin** — <brief reason>
|
||||
> **What I found:** <details>
|
||||
> **Why I can't resolve this:** <reason>
|
||||
|
||||
## Commit Convention
|
||||
|
||||
```
|
||||
feat: <description> (fixes #N)
|
||||
|
||||
Co-Authored-By: issue-responder <noreply@anthropic.com>
|
||||
```
|
||||
|
||||
Or for incident fixes:
|
||||
```
|
||||
fix: <description> (fixes #N)
|
||||
|
||||
Co-Authored-By: issue-responder <noreply@anthropic.com>
|
||||
```
|
||||
|
|
@ -1,543 +0,0 @@
|
|||
---
|
||||
name: k8s-version-upgrade-DEPRECATED
|
||||
description: "DEPRECATED 2026-05-11 — replaced by the Job-chain in stacks/k8s-version-upgrade. See header below."
|
||||
tools: Read, Write, Edit, Bash, Grep, Glob
|
||||
model: opus
|
||||
---
|
||||
|
||||
# DEPRECATED — Do NOT invoke this agent
|
||||
|
||||
Retired **2026-05-11** after a self-preemption incident: this agent ran inside
|
||||
the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was
|
||||
scheduled onto k8s-node4. When the agent tried to `kubectl drain k8s-node4`
|
||||
(Stage 6, first worker), it evicted itself. The bash process died mid-SSH,
|
||||
leaving node4 cordoned and the cluster half-upgraded (master at v1.34.7,
|
||||
workers at v1.34.2).
|
||||
|
||||
## Replaced by
|
||||
|
||||
A chain of small Kubernetes Jobs, each pinned (via `nodeSelector` +
|
||||
`kubernetes.io/hostname`) to a node that is NOT its drain target. No pod can
|
||||
preempt itself because each Job's pod and its target node are always
|
||||
different.
|
||||
|
||||
| Old | New |
|
||||
|-----|-----|
|
||||
| Single agent run in claude-agent-service pod | Chain of 7 phase Jobs (preflight → master → worker × 4 → postflight) |
|
||||
| Whole pipeline in one prompt | Phase body in `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`, dispatched per-phase via `case $PHASE` |
|
||||
| Detection CronJob POSTs to `claude-agent-service` | Detection CronJob renders Job 0 from `job-template.yaml` via `envsubst` + `kubectl apply` |
|
||||
| Drain blocks indefinitely on PDB=0 (e.g. single-replica Anubis) | New `predrain_unstick` deletes PDB-blocked pods so drain proceeds |
|
||||
| `K8sVersionSkew` + `EtcdPreUpgradeSnapshotMissing` alerts | Above + `K8sUpgradeStalled` (in_flight=1 and time()-started_timestamp > 5400s) |
|
||||
|
||||
## Where the logic lives now
|
||||
|
||||
- **`infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`** — universal
|
||||
phase body. Dispatches on `$PHASE`. Each phase spawns the next Job.
|
||||
- **`infra/stacks/k8s-version-upgrade/job-template.yaml`** — Job template
|
||||
rendered by `envsubst` at runtime. ConfigMap-mounted at `/template` in
|
||||
every Job pod.
|
||||
- **`infra/stacks/k8s-version-upgrade/main.tf`** — Terraform stack: ConfigMaps,
|
||||
unified `k8s-upgrade-job` ServiceAccount + RBAC, detection CronJob.
|
||||
- **`infra/docs/runbooks/k8s-version-upgrade.md`** — operator runbook (kill a
|
||||
stuck Job, skip a phase, manually re-trigger from a specific phase).
|
||||
|
||||
## Why kept (not deleted)
|
||||
|
||||
Documents the prompted-agent design and is useful as historical reference when
|
||||
reading post-mortem discussions or comparing approaches. The `name` field has
|
||||
been suffixed with `-DEPRECATED` so the agent cannot be invoked by name from
|
||||
`claude-agent-service`.
|
||||
|
||||
---
|
||||
|
||||
# Original prompt — DO NOT EXECUTE (reference only)
|
||||
|
||||
You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA).
|
||||
|
||||
## Your Job
|
||||
|
||||
Given a target patch or minor version of `kubeadm`/`kubelet`/`kubectl`, you orchestrate the full rolling upgrade with safety gates between every node. You do NOT decide WHEN to run — the `k8s-version-check` CronJob in the `k8s-upgrade` namespace fires you off after detection. You only run when invoked.
|
||||
|
||||
The sequence (Pre-flight → etcd snapshot → master containerd skew fix → apt repo URL change [minor only] → master kubeadm upgrade → workers sequentially → Post-flight) is non-negotiable. Skipping a step is how clusters die.
|
||||
|
||||
## Inputs
|
||||
|
||||
The user prompt contains a JSON object with these fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"target_version": "1.34.5",
|
||||
"kind": "patch",
|
||||
"dry_run": false,
|
||||
"stages": "all"
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Required | Description |
|
||||
|---|---|---|
|
||||
| `target_version` | yes | Exact `X.Y.Z` to land on (e.g. `1.34.5`). The script `infra/scripts/update_k8s.sh` accepts this via `--release`. |
|
||||
| `kind` | yes | `patch` (no apt-repo URL change) or `minor` (rewrite repo to v$NEW_MINOR/deb on every node before kubeadm). |
|
||||
| `dry_run` | no, default false | If true, run all SSH + kubectl READ commands but skip every mutating command (`apt-get install`, `kubeadm upgrade apply`, `kubeadm upgrade node`, `kubectl drain/uncordon`, etcd snapshot, systemctl restart). Log what you would do and exit 0. |
|
||||
| `stages` | no, default `all` | Comma-separated subset of: `preflight`, `snapshot`, `containerd`, `repo`, `master`, `workers`, `postflight`. Run only those stages and exit. Used by tests. |
|
||||
|
||||
Parse the prompt's first JSON block to extract these. If anything is missing, abort with a Slack notification ("malformed payload").
|
||||
|
||||
## Environment
|
||||
|
||||
- **Working dir**: `/workspace/infra` (`WORKSPACE_DIR` env var)
|
||||
- **Kubeconfig**: `/workspace/infra/config` (use `kubectl --kubeconfig $WORKSPACE_DIR/config ...` in every kubectl call)
|
||||
- **Prometheus**: `http://prometheus-server.monitoring.svc.cluster.local:80` (in-cluster, no auth)
|
||||
- **Etcd snapshot**: triggered as a one-shot Job from the existing `default/backup-etcd` CronJob (defined in `stacks/infra-maintenance/`). The Job runs on `k8s-master` with hostNetwork (so etcdctl reaches etcd at 127.0.0.1:2379), mounts the PV-backed NFS export `192.168.1.127:/srv/nfs/etcd-backup`, and writes `etcd-snapshot-<TIMESTAMP>.db` there. Do NOT shell into master with etcdctl directly — the cert paths + NFS mount are already wired into the CronJob.
|
||||
- **Library script**: `/workspace/infra/scripts/update_k8s.sh` — pipe via SSH to each node, do NOT modify on the fly. Invoke as `ssh ... 'bash -s' < update_k8s.sh --role <role> --release <X.Y.Z>`.
|
||||
|
||||
### Credentials — fetched at startup
|
||||
|
||||
The k8s-upgrade ServiceAccount has GET on the `k8s-upgrade-creds` Secret in the `k8s-upgrade` namespace (granted by a RoleBinding in `stacks/k8s-version-upgrade/main.tf`). Fetch credentials into `/tmp` files at the start of every run:
|
||||
|
||||
```bash
|
||||
KUBECTL="kubectl --kubeconfig $WORKSPACE_DIR/config"
|
||||
|
||||
# SSH private key — mode 0400 required by openssh
|
||||
$KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
|
||||
-o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
|
||||
chmod 400 /tmp/k8s-upgrade-ssh-key
|
||||
|
||||
# Slack webhook (URL string)
|
||||
SLACK_WEBHOOK_K8S_UPGRADE=$($KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
|
||||
-o jsonpath='{.data.slack_webhook}' | base64 -d)
|
||||
```
|
||||
|
||||
The rest of the prompt uses `/tmp/k8s-upgrade-ssh-key` for SSH and `$SLACK_WEBHOOK_K8S_UPGRADE` for Slack. SSH template:
|
||||
|
||||
```bash
|
||||
SSH="ssh -i /tmp/k8s-upgrade-ssh-key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts"
|
||||
```
|
||||
|
||||
Every SSH call below uses `$SSH wizard@<host> '<cmd>'`. `accept-new` accepts the host key on first encounter then pins it — if a node was reimaged, clear `/tmp/known_hosts` before retry.
|
||||
|
||||
## NEVER do
|
||||
|
||||
- Never bypass the halt-on-alert check — even if a single alert "looks unrelated"
|
||||
- Never start the next worker before the previous one is Ready + all its pods rescheduled + 10-min soak observed
|
||||
- Never skip the etcd snapshot — even for patch
|
||||
- Never `kubectl edit/patch/delete` — read-only kubectl plus `drain`/`uncordon` only
|
||||
- Never `apt-mark hold` something without unholding it first, and vice versa — the script handles this; don't do it manually
|
||||
- Never run two stages in parallel — sequential only
|
||||
- Never run if `dry_run=false` AND the cluster has a node Not Ready, or any Upgrade Gates alert firing
|
||||
- Never push to git, never modify Terraform, never invoke claude-agent-service recursively
|
||||
|
||||
## Slack + Pushgateway helpers
|
||||
|
||||
Every transition posts to Slack:
|
||||
|
||||
```bash
|
||||
slack() {
|
||||
local msg="$1"
|
||||
local hook="${SLACK_WEBHOOK_K8S_UPGRADE:-$SLACK_WEBHOOK_URL}"
|
||||
curl -sS -X POST -H 'Content-Type: application/json' \
|
||||
--data "$(jq -nc --arg t "[k8s-upgrade] $msg" '{text: $t}')" \
|
||||
"$hook"
|
||||
}
|
||||
```
|
||||
|
||||
Start every message with `[k8s-upgrade]` so it's grep-able.
|
||||
|
||||
Pushgateway gauges drive the `EtcdPreUpgradeSnapshotMissing` and ops-visibility metrics:
|
||||
|
||||
```bash
|
||||
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
|
||||
|
||||
push_metric() {
|
||||
# push_metric <name> <value>
|
||||
local name="$1" val="$2"
|
||||
printf '# TYPE %s gauge\n%s %s\n' "$name" "$name" "$val" \
|
||||
| curl -sS --data-binary @- "$PG"
|
||||
}
|
||||
```
|
||||
|
||||
Pushes you must make at specific stages (skipped in dry_run):
|
||||
| When | Metric | Value |
|
||||
|---|---|---|
|
||||
| Stage 0 start | `k8s_upgrade_in_flight` | `1` |
|
||||
| Stage 0 start | `k8s_upgrade_target_minor` | `$target_minor` |
|
||||
| Stage 2 verified | `k8s_upgrade_snapshot_taken` | `1` |
|
||||
| Stage 7 clean | `k8s_upgrade_in_flight` | `0` |
|
||||
| Stage 7 clean | `k8s_upgrade_snapshot_taken` | `0` |
|
||||
|
||||
If you abort mid-flight, leave `k8s_upgrade_in_flight=1` so the alert fires and surfaces the half-done state.
|
||||
|
||||
## Stage 0: Parse inputs + announce
|
||||
|
||||
1. Extract `target_version`, `kind`, `dry_run`, `stages` from the prompt JSON.
|
||||
2. Derive `target_minor` from `target_version` (split on `.`).
|
||||
3. Mark the in-flight annotation on the namespace AND push Pushgateway in-flight gauge:
|
||||
```bash
|
||||
if [ "$dry_run" = "false" ]; then
|
||||
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
|
||||
viktorbarzin.me/k8s-upgrade-in-flight="$(date -u +%FT%TZ)" \
|
||||
viktorbarzin.me/k8s-upgrade-target="$target_version" \
|
||||
--overwrite
|
||||
|
||||
push_metric k8s_upgrade_in_flight 1
|
||||
push_metric k8s_upgrade_snapshot_taken 0
|
||||
fi
|
||||
```
|
||||
4. Slack: `Starting k8s upgrade to v$target_version (kind=$kind, dry_run=$dry_run, stages=$stages)`.
|
||||
|
||||
## Stage 1: Pre-flight (`stages` includes `preflight`)
|
||||
|
||||
Skip if `stages` excludes `preflight`.
|
||||
|
||||
### Check 1.1 — All nodes Ready, no pressure
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o json \
|
||||
| jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="Ready") | .status), Mem=\(.status.conditions[] | select(.type=="MemoryPressure") | .status), Disk=\(.status.conditions[] | select(.type=="DiskPressure") | .status)"'
|
||||
```
|
||||
|
||||
Abort if any node is not Ready=True, or has MemoryPressure=True or DiskPressure=True.
|
||||
|
||||
### Check 1.2 — Halt-on-alert (same query kured uses)
|
||||
|
||||
```bash
|
||||
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
|
||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
||||
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
|
||||
| sort -u)
|
||||
|
||||
if [ -n "$ALERTS" ]; then
|
||||
slack "ABORT preflight — firing alerts:\n$ALERTS"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### Check 1.3 — 24h-quiet baseline
|
||||
|
||||
Re-uses the sentinel-gate Check 4 logic from `stacks/kured/main.tf`. Any node that transitioned Ready in the last 24h means the cluster just absorbed a node reboot — we want a clean baseline before starting a fresh rollout.
|
||||
|
||||
```bash
|
||||
RECENT_REBOOT=0
|
||||
while IFS= read -r ts; do
|
||||
[ -z "$ts" ] && continue
|
||||
diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
|
||||
[ "$diff" -lt 86400 ] && RECENT_REBOOT=1 && break
|
||||
done < <(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
|
||||
|
||||
if [ "$RECENT_REBOOT" -eq 1 ]; then
|
||||
slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### Check 1.4 — kubeadm upgrade plan reports our target
|
||||
|
||||
```bash
|
||||
PLAN_TARGET=$($SSH \
|
||||
wizard@k8s-master 'sudo kubeadm upgrade plan' \
|
||||
| grep -oE 'You can now apply the upgrade by executing the following command:.*v[0-9]+\.[0-9]+\.[0-9]+' \
|
||||
| grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
|
||||
```
|
||||
|
||||
If `$PLAN_TARGET` does not start with the requested `target_version`, slack-abort:
|
||||
"`kubeadm upgrade plan` says target is $PLAN_TARGET but caller asked for $target_version — drift; aborting."
|
||||
|
||||
Slack: `Pre-flight clean. Proceeding to etcd snapshot.`
|
||||
|
||||
## Stage 2: Etcd snapshot (`stages` includes `snapshot`)
|
||||
|
||||
Always run — patch OR minor. Triggers a one-shot Job from the existing `default/backup-etcd` CronJob and waits for it to complete.
|
||||
|
||||
```bash
|
||||
JOB_NAME="pre-upgrade-etcd-${target_version}-$(date +%s)"
|
||||
|
||||
if [ "$dry_run" = "false" ]; then
|
||||
$KUBECTL -n default create job --from=cronjob/backup-etcd "$JOB_NAME"
|
||||
|
||||
# Wait up to 10 min for snapshot Job to complete
|
||||
$KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$JOB_NAME" || {
|
||||
slack "ABORT Stage 2 — etcd snapshot Job did not complete in 10 min"
|
||||
$KUBECTL -n default describe "job/$JOB_NAME" | tail -30
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Parse the Job's pod log for "Backup done: <file> (<bytes> bytes)"
|
||||
LOG=$($KUBECTL -n default logs "job/$JOB_NAME" -c backup-manage --tail=20)
|
||||
echo "$LOG"
|
||||
SNAPSHOT_LINE=$(echo "$LOG" | grep -E '^Backup done:')
|
||||
SIZE=$(echo "$SNAPSHOT_LINE" | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+')
|
||||
SNAPSHOT_FILE=$(echo "$SNAPSHOT_LINE" | awk '{print $3}')
|
||||
|
||||
if [ -z "$SIZE" ] || [ "$SIZE" -lt 1024 ]; then
|
||||
slack "ABORT Stage 2 — etcd snapshot empty or missing (size='$SIZE' line='$SNAPSHOT_LINE')"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
TARGET_PATH="nfs://192.168.1.127:/srv/nfs/etcd-backup/$SNAPSHOT_FILE"
|
||||
$KUBECTL annotate ns k8s-upgrade \
|
||||
viktorbarzin.me/k8s-upgrade-snapshot-path="$TARGET_PATH" --overwrite
|
||||
|
||||
push_metric k8s_upgrade_snapshot_taken 1
|
||||
else
|
||||
TARGET_PATH="WOULD: trigger default/backup-etcd Job, wait, verify size"
|
||||
SIZE="dry-run"
|
||||
fi
|
||||
|
||||
slack "Etcd snapshot saved at $TARGET_PATH (size=$SIZE)"
|
||||
```
|
||||
|
||||
## Stage 3: Master containerd skew fix (`stages` includes `containerd`)
|
||||
|
||||
Only run if master containerd version < highest worker containerd version.
|
||||
|
||||
```bash
|
||||
get_ctr_version() {
|
||||
$SSH \
|
||||
"wizard@$1" 'containerd --version | awk "{print \$3}" | tr -d v'
|
||||
}
|
||||
|
||||
MASTER_CTR=$(get_ctr_version k8s-master)
|
||||
WORKER_MAX="0.0.0"
|
||||
for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
v=$(get_ctr_version "$n")
|
||||
# Compare semver-ish
|
||||
if [ "$(printf '%s\n%s' "$v" "$WORKER_MAX" | sort -V | tail -1)" = "$v" ]; then
|
||||
WORKER_MAX="$v"
|
||||
fi
|
||||
done
|
||||
|
||||
if [ "$(printf '%s\n%s' "$MASTER_CTR" "$WORKER_MAX" | sort -V | head -1)" = "$MASTER_CTR" ] \
|
||||
&& [ "$MASTER_CTR" != "$WORKER_MAX" ]; then
|
||||
# Master is behind — bump
|
||||
slack "Master containerd $MASTER_CTR < workers $WORKER_MAX — bumping master"
|
||||
|
||||
if [ "$dry_run" = "false" ]; then
|
||||
$SSH \
|
||||
wizard@k8s-master "sudo apt-mark unhold containerd.io \
|
||||
&& sudo apt-get install -y containerd.io='$WORKER_MAX-1' \
|
||||
&& sudo apt-mark hold containerd.io \
|
||||
&& sudo systemctl restart containerd"
|
||||
|
||||
# Wait until kubelet on master is Ready again
|
||||
for i in $(seq 1 60); do
|
||||
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
|
||||
[ "$STATUS" = "True" ] && break
|
||||
sleep 10
|
||||
done
|
||||
[ "$STATUS" = "True" ] || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
|
||||
fi
|
||||
|
||||
slack "Master containerd: $MASTER_CTR → $WORKER_MAX. Master Ready."
|
||||
else
|
||||
echo "Master containerd $MASTER_CTR >= workers max $WORKER_MAX — skipping skew fix"
|
||||
fi
|
||||
```
|
||||
|
||||
## Stage 4: Apt repo URL rewrite for minor bumps (`stages` includes `repo`)
|
||||
|
||||
Only run if `kind=minor`.
|
||||
|
||||
For each of `k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4`:
|
||||
|
||||
```bash
|
||||
target_minor="$(echo "$target_version" | awk -F. '{print $1"."$2}')"
|
||||
|
||||
if [ "$dry_run" = "false" ]; then
|
||||
$SSH \
|
||||
"wizard@$node" "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
|
||||
&& curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
|
||||
&& sudo apt-get update"
|
||||
fi
|
||||
```
|
||||
|
||||
Slack: `Repo rewritten to v$target_minor/deb on all 5 nodes.`
|
||||
|
||||
## Stage 5: Master upgrade (`stages` includes `master`)
|
||||
|
||||
```bash
|
||||
# 5.1 Drain
|
||||
if [ "$dry_run" = "false" ]; then
|
||||
kubectl --kubeconfig $WORKSPACE_DIR/config drain k8s-master \
|
||||
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
|
||||
fi
|
||||
|
||||
# 5.2 Run the library script via SSH pipe
|
||||
if [ "$dry_run" = "false" ]; then
|
||||
$SSH \
|
||||
wizard@k8s-master 'bash -s' \
|
||||
< $WORKSPACE_DIR/scripts/update_k8s.sh \
|
||||
-- --role master --release "$target_version"
|
||||
fi
|
||||
|
||||
# 5.3 Uncordon + wait Ready
|
||||
if [ "$dry_run" = "false" ]; then
|
||||
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon k8s-master
|
||||
fi
|
||||
|
||||
for i in $(seq 1 60); do
|
||||
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
|
||||
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
|
||||
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
|
||||
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
|
||||
sleep 15
|
||||
done
|
||||
|
||||
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|
||||
|| { slack "ABORT — master not Ready or wrong version after upgrade ($STATUS / $KUBELET)"; exit 1; }
|
||||
|
||||
# 5.4 All control-plane pods Running
|
||||
NOT_READY=$(kubectl --kubeconfig $WORKSPACE_DIR/config -n kube-system get pods \
|
||||
-l 'tier=control-plane' --no-headers | grep -v Running | wc -l)
|
||||
[ "$NOT_READY" -gt 0 ] && { slack "ABORT — $NOT_READY control-plane pods not Running"; exit 1; }
|
||||
|
||||
# 5.5 Re-check halt-on-alert
|
||||
# (re-run the Check 1.2 query, abort if anything new fires)
|
||||
|
||||
slack "Master upgrade complete. Cluster on v$target_version. Healthy."
|
||||
```
|
||||
|
||||
## Stage 6: Workers sequentially (`stages` includes `workers`)
|
||||
|
||||
Order: `k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1`. Node1 last because it hosts GPU + Immich and benefits from the longest soak before any other worker is touched (ref: post-mortem-2026-03-16, memory id=570).
|
||||
|
||||
For each worker `$node`:
|
||||
|
||||
1. Re-check halt-on-alert. If anything fires (e.g. `RecentNodeReboot` on the previous worker), wait + retry up to 30 min, then abort.
|
||||
2. `kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
|
||||
3. SSH pipe `update_k8s.sh --role worker --release $target_version`
|
||||
4. `kubectl uncordon $node`
|
||||
5. Wait until `$node` Ready + kubeletVersion matches + all calico-node + kube-proxy pods on that node Running.
|
||||
6. **10-min soak**: poll halt-on-alert every 60s. If anything fires, abort. After 10 min clean, proceed.
|
||||
7. Slack: `Worker $node complete ($i/4)`.
|
||||
|
||||
```bash
|
||||
WORKERS="k8s-node4 k8s-node3 k8s-node2 k8s-node1"
|
||||
i=0
|
||||
for node in $WORKERS; do
|
||||
i=$((i+1))
|
||||
|
||||
# Halt-on-alert recheck with retry
|
||||
for attempt in $(seq 1 30); do
|
||||
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
|
||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
||||
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
|
||||
| sort -u)
|
||||
[ -z "$ALERTS" ] && break
|
||||
echo "Waiting for alerts to clear (attempt $attempt/30): $ALERTS"
|
||||
sleep 60
|
||||
done
|
||||
[ -n "$ALERTS" ] && { slack "ABORT $node — alerts firing after 30min wait: $ALERTS"; exit 1; }
|
||||
|
||||
if [ "$dry_run" = "false" ]; then
|
||||
kubectl --kubeconfig $WORKSPACE_DIR/config drain "$node" \
|
||||
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
|
||||
|
||||
$SSH \
|
||||
"wizard@$node" 'bash -s' \
|
||||
< $WORKSPACE_DIR/scripts/update_k8s.sh \
|
||||
-- --role worker --release "$target_version"
|
||||
|
||||
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon "$node"
|
||||
fi
|
||||
|
||||
# Wait Ready + version match
|
||||
for w in $(seq 1 60); do
|
||||
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
|
||||
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
|
||||
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
|
||||
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
|
||||
sleep 15
|
||||
done
|
||||
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|
||||
|| { slack "ABORT — $node not Ready or wrong version ($STATUS / $KUBELET)"; exit 1; }
|
||||
|
||||
# 10-min soak with halt-on-alert
|
||||
echo "Soaking $node for 10 min..."
|
||||
for sec in $(seq 1 10); do
|
||||
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
|
||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
||||
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|RecentNodeReboot)$' \
|
||||
| sort -u)
|
||||
[ -n "$ALERTS" ] && { slack "ABORT $node mid-soak — alerts: $ALERTS"; exit 1; }
|
||||
sleep 60
|
||||
done
|
||||
|
||||
slack "Worker $node upgrade complete ($i/4). Soaked clean."
|
||||
done
|
||||
```
|
||||
|
||||
Note: during the soak we add `RecentNodeReboot` to the ignore-list because we KNOW we just rebooted-as-it-were that node (kubelet restart counts).
|
||||
|
||||
## Stage 7: Post-flight (`stages` includes `postflight`)
|
||||
|
||||
```bash
|
||||
# All 5 nodes at target
|
||||
VERSIONS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes \
|
||||
-o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
|
||||
echo "$VERSIONS"
|
||||
WRONG=$(echo "$VERSIONS" | grep -v ":v${target_version}$" | wc -l)
|
||||
[ "$WRONG" -ne 0 ] && { slack "ABORT post-flight — $WRONG node(s) not on v$target_version:\n$VERSIONS"; exit 1; }
|
||||
|
||||
# Upgrade Gates all inactive
|
||||
FIRING=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
|
||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
||||
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
|
||||
| sort -u)
|
||||
[ -n "$FIRING" ] && slack "Post-flight WARN — alerts still firing (cluster on target, but check):\n$FIRING"
|
||||
|
||||
# pod-ready ratio >= 0.9
|
||||
RATIO=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query' \
|
||||
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
|
||||
| jq -r '.data.result[0].value[1] // "0"')
|
||||
slack "Pod-ready ratio: $RATIO (target ≥ 0.9)"
|
||||
|
||||
# Clear the in-flight annotation + Pushgateway gauges
|
||||
if [ "$dry_run" = "false" ]; then
|
||||
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
|
||||
viktorbarzin.me/k8s-upgrade-in-flight- \
|
||||
viktorbarzin.me/k8s-upgrade-target- \
|
||||
viktorbarzin.me/k8s-upgrade-snapshot-path- || true
|
||||
|
||||
push_metric k8s_upgrade_in_flight 0
|
||||
push_metric k8s_upgrade_snapshot_taken 0
|
||||
fi
|
||||
|
||||
slack ":white_check_mark: K8s upgrade complete: cluster on v$target_version."
|
||||
```
|
||||
|
||||
## Rollback
|
||||
|
||||
This agent does NOT auto-rollback. If anything aborts mid-flight:
|
||||
|
||||
1. Slack the failure with the last known stage + node.
|
||||
2. Leave the in-flight annotation in place (the operator clears it manually after triage).
|
||||
3. Operator follows `infra/docs/runbooks/k8s-version-upgrade.md` → "Rollback paths" section.
|
||||
|
||||
The etcd snapshot path is annotated on the `k8s-upgrade` namespace for easy recovery.
|
||||
|
||||
## Notes for tests
|
||||
|
||||
- **Test 1 (CronJob dry-run)**: The CronJob has its own `--dry-run` env var that short-circuits before POST. This agent is not invoked.
|
||||
- **Test 2 (agent dry-run)**: Invoke with `{"dry_run": true}`. Every SSH + kubectl READ runs, every mutation skipped. The agent should print "WOULD: <cmd>" for each skipped mutation.
|
||||
- **Test 3 (snapshot-only)**: Invoke with `{"stages": "preflight,snapshot"}`. Pre-flight + etcd snapshot only. Slack notification confirms the file exists. No node touched after that.
|
||||
- **Test 4 (full run)**: `{"target_version": "1.34.7", "kind": "patch"}` once apt has it. Full sequence.
|
||||
- **Test 5 (synthetic minor)**: `{"target_version": "1.35.0", "kind": "minor", "dry_run": true}`. Confirms the repo-rewrite plan path without mutation.
|
||||
|
||||
## Edge cases
|
||||
|
||||
- **Slack down**: Don't block the upgrade — continue, log to stderr.
|
||||
- **SSH host key changes**: `accept-new` accepts only on first encounter — if a node was reimaged its host key changes; clear `/tmp/known_hosts` before retry.
|
||||
- **kubectl drain hangs on a PDB-violating pod**: 5-min grace-period is hard. If drain fails, `kubectl drain --disable-eviction --force` is NOT a valid escalation here — slack-abort and let the operator investigate.
|
||||
- **etcd snapshot dir missing/full**: stat the dir first. If <10 GiB free, abort.
|
||||
- **Network blip during apt-get**: the script `set -e`s — apt-get will fail loud, the agent's bash will see non-zero exit, we slack-abort. The node is left mid-upgrade (kubeadm half-applied). Operator follows the runbook.
|
||||
|
||||
## Verification claims you must make
|
||||
|
||||
When you `slack` a SUCCESS message, you must have actually verified:
|
||||
- All 5 nodes report the target kubelet version via `kubectl get nodes -o jsonpath`
|
||||
- No alerts firing outside the ignore-list
|
||||
- pod-ready ratio computed from Prometheus
|
||||
|
||||
Do not declare success without those three confirmations.
|
||||
|
|
@ -1,194 +0,0 @@
|
|||
---
|
||||
name: payslip-extractor
|
||||
description: "Extract structured UK payslip fields from already-extracted text (preferred) or a base64 PDF (fallback) into strict JSON."
|
||||
model: haiku
|
||||
allowedTools:
|
||||
- Bash
|
||||
- Read
|
||||
---
|
||||
|
||||
You are a headless payslip-field extractor. You receive a prompt containing a UK payslip (either as pre-extracted text or as a base64-encoded PDF) plus a target JSON schema, and you produce exactly one JSON object that matches the schema.
|
||||
|
||||
## Your single job
|
||||
|
||||
Given a prompt that contains EITHER:
|
||||
- A line `PAYSLIP_TEXT:` followed by already-extracted text (preferred path — use it directly, skip to Step 3).
|
||||
- OR a line `PDF_BASE64:` followed by a base64 blob (fallback path — decode then extract text first).
|
||||
|
||||
Produce EXACTLY ONE JSON object on stdout matching the schema. No prose. No markdown fences. No preamble. No trailing commentary. The final message content must be a single valid JSON object and nothing else.
|
||||
|
||||
## RSU handling (important — Meta UK payslips)
|
||||
|
||||
UK payslips for equity-compensated employees (e.g. Meta) report RSU vests as NOTIONAL pay for HMRC reporting only — the broker (Schwab) sells shares to cover US-side withholding but the UK payslip ALSO runs the vest through PAYE via a grossed-up Taxable Pay line. Meta UK template:
|
||||
|
||||
- EARNINGS lines: `RSU Tax Offset` (grossed-up vest value) and optionally `RSU Excs Refund` (over-withheld amount returned). SUM BOTH into `rsu_vest`. Other labels seen on non-Meta templates: `RSU Vest`, `Restricted Stock Units`, `Notional Pay`, `GSU Vest`.
|
||||
- Meta's template does NOT use a matching offset deduction — `rsu_offset` should be 0. Taxable Pay is grossed up to (Total Payment + rsu_vest) so PAYE already includes the RSU share.
|
||||
- For non-Meta templates that DO use an offset (`Shares Retained`, `Notional Pay Offset`), populate `rsu_offset` with the magnitude.
|
||||
|
||||
If you see ANY of these lines, do NOT add them to `other_deductions` and do NOT let them count as regular income_tax/NI.
|
||||
|
||||
If the payslip has no stock component, leave both as 0.
|
||||
|
||||
## Earnings decomposition (v2)
|
||||
|
||||
- `salary`: the basic salary/pay line (usually the first "Salary" or "Basic Pay" entry in the Earnings/Payments block).
|
||||
- `bonus`: the bonus line (`Perform Bonus`, `Bonus`, `Performance Bonus`). If absent or 0, leave as 0 — that's meaningful signal (bonus-sacrifice months). Don't invent.
|
||||
- `pension_sacrifice`: **ABSOLUTE VALUE** of any NEGATIVE pension line in the Payments block (e.g. `AE Pension EE -600.20` → `600.20`). This is salary-sacrifice and is ALREADY subtracted from Total Payment/gross. Do not also put it in `pension_employee`.
|
||||
- `pension_employee`: use this ONLY when pension appears as a POSITIVE deduction on the Deductions side (legacy Meta variant A, or non-Meta templates). Never double-count.
|
||||
- `taxable_pay`: the "Taxable Pay" line in the summary block, THIS PERIOD column. For Meta this is the post-sacrifice + RSU-grossed-up base that PAYE is computed on. If the payslip doesn't surface a summary block, null.
|
||||
- `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross`: YTD column values from the same summary block. Null if not present.
|
||||
|
||||
## Fast path: PAYSLIP_TEXT is present
|
||||
|
||||
If the prompt contains `PAYSLIP_TEXT:`, the caller has already run `pdftotext -layout`. Skip Steps 1-2 entirely — the text is already in your context. Go straight to Step 3.
|
||||
|
||||
## Processing steps
|
||||
|
||||
### Step 1. Extract and decode the base64 PDF
|
||||
|
||||
The prompt will include a line that starts with `PDF_BASE64:` followed by the base64 blob. Decode it to `/tmp/payslip.pdf`.
|
||||
|
||||
Preferred method (handles whitespace and very long blobs robustly):
|
||||
|
||||
```bash
|
||||
python3 - <<'PY'
|
||||
import base64, re, pathlib, sys, os
|
||||
prompt = os.environ.get("PAYSLIP_PROMPT", "")
|
||||
# If the orchestrator didn't set an env var, fall back to reading the transcript via CWD stdin mechanism.
|
||||
# In practice the agent receives the prompt in its conversation — you extract the PDF_BASE64 value
|
||||
# from the prompt text you were given, strip whitespace, and base64-decode.
|
||||
PY
|
||||
```
|
||||
|
||||
In practice: read the `PDF_BASE64:` value out of the prompt you have been given (you can see the full prompt), then run:
|
||||
|
||||
```bash
|
||||
python3 -c "
|
||||
import base64, sys
|
||||
data = sys.stdin.read().strip()
|
||||
open('/tmp/payslip.pdf','wb').write(base64.b64decode(data))
|
||||
print('decoded bytes:', len(base64.b64decode(data)))
|
||||
" <<'B64'
|
||||
<paste-the-base64-here>
|
||||
B64
|
||||
```
|
||||
|
||||
Or pipe via shell `base64 -d`:
|
||||
|
||||
```bash
|
||||
printf '%s' '<base64>' | base64 -d > /tmp/payslip.pdf
|
||||
```
|
||||
|
||||
Verify the file looks like a PDF:
|
||||
|
||||
```bash
|
||||
head -c 8 /tmp/payslip.pdf | xxd
|
||||
# Expected: 25 50 44 46 2d (i.e. "%PDF-")
|
||||
```
|
||||
|
||||
### Step 2. Extract text from the PDF
|
||||
|
||||
Try tools in this order. Use the first one that works; do not chain all of them.
|
||||
|
||||
1. `pdftotext` from `poppler-utils` (preferred — fastest, most reliable on layout-preserving payslips):
|
||||
```bash
|
||||
pdftotext -layout /tmp/payslip.pdf - 2>/dev/null
|
||||
```
|
||||
|
||||
2. Python `pypdf` fallback:
|
||||
```bash
|
||||
python3 -c "
|
||||
from pypdf import PdfReader
|
||||
r = PdfReader('/tmp/payslip.pdf')
|
||||
for p in r.pages:
|
||||
print(p.extract_text() or '')
|
||||
"
|
||||
```
|
||||
|
||||
3. Python `pdfplumber` fallback:
|
||||
```bash
|
||||
python3 -c "
|
||||
import pdfplumber
|
||||
with pdfplumber.open('/tmp/payslip.pdf') as pdf:
|
||||
for page in pdf.pages:
|
||||
print(page.extract_text() or '')
|
||||
"
|
||||
```
|
||||
|
||||
4. If none of those are installed, check what IS available:
|
||||
```bash
|
||||
which pdftotext pdf2txt.py mutool
|
||||
python3 -c "import pypdf, pdfplumber, pdfminer" 2>&1
|
||||
```
|
||||
and use whatever you find (e.g. `mutool draw -F txt`).
|
||||
|
||||
If every text-extraction tool fails, emit the failure JSON (see "Failure mode" below).
|
||||
|
||||
### Step 3. Parse the extracted text
|
||||
|
||||
UK payslips are laid out in a few common templates (Sage, Iris, QuickBooks, Xero, in-house ADP/Workday layouts). Common landmarks:
|
||||
|
||||
- "Pay Date" / "Payment Date" / "Date Paid" — the date wages hit the account. Usually at the top or in a header box.
|
||||
- "Tax Period" / "Period" / "Month" — e.g. "Month 1", "Week 12".
|
||||
- Two numeric columns per line: "This Period" (or "Amount", "Current") and "Year to Date" (or "YTD"). **Always take the This Period column**, never YTD.
|
||||
- Payments / Earnings block: "Basic Pay", "Salary", "Bonus", "Overtime", "Commission", "Holiday Pay".
|
||||
- Deductions block: "Income Tax" / "PAYE", "National Insurance" / "NI" / "NIC", "Pension" / "Pension Contribution" / "Salary Sacrifice Pension", "Student Loan" / "SL", optional: "Union Dues", "Charity", "Season Ticket Loan", "Private Medical", etc.
|
||||
- "Gross Pay" / "Total Gross" — sum of payments.
|
||||
- "Net Pay" / "Take Home" / "Amount Payable" — the money actually paid.
|
||||
- "Tax Code" — e.g. "1257L", "BR", "D0", "NT".
|
||||
- "NI Number" / "National Insurance Number" — `AA123456A` format. Never invent one.
|
||||
- "Employer" / "Company" — usually in the letterhead. "Employee" / "Name".
|
||||
- Currency: almost always GBP / "£" for UK payslips. If the PDF is not in GBP or not a UK payslip, still return the numbers as-is but include a best-effort `currency` field.
|
||||
|
||||
### Step 4. Map to the schema and emit JSON
|
||||
|
||||
Rules that apply regardless of the caller's exact schema:
|
||||
|
||||
- **Dates**: `pay_date` MUST be `YYYY-MM-DD`. If the PDF prints `12/03/2026`, interpret as `DD/MM/YYYY` (UK format) → `2026-03-12`. If ambiguous (`01/02/2026`), prefer UK ordering. If impossible to determine a year, use the pay_period year.
|
||||
- **Money fields**: emit as JSON numbers, not strings. Two decimal places are acceptable (`2450.17`). Strip `£`, commas, and trailing spaces. Negative values stay negative.
|
||||
- **Missing numeric fields**: emit `0` (zero), not `null`, not an empty string, not `"N/A"`.
|
||||
- **`other_deductions`**: an object mapping `{ "<label>": <number>, ... }` for any deduction that isn't one of the first-class fields in the schema (tax, NI, pension, student loan). Use the exact label from the payslip (e.g. `"Season Ticket Loan"`, `"Private Medical"`). If there are no other deductions, emit `{}` — NEVER `null` and NEVER omit the key.
|
||||
- **Column discipline**: ALWAYS use the "This Period" column, NEVER the YTD column. If only one column exists, that's the period column.
|
||||
- **Currency default**: `"GBP"` unless the payslip explicitly shows another currency symbol or ISO code.
|
||||
- **No invented data**: If a field genuinely isn't on the payslip, use the documented default (`0` for money, `""` for strings, `{}` for objects). Do NOT make up names, NI numbers, tax codes, or employers.
|
||||
|
||||
Follow the exact field names and types given in the prompt's schema. If the prompt's schema adds fields not listed above, produce them too using the same discipline.
|
||||
|
||||
## Failure mode
|
||||
|
||||
If the PDF cannot be read at all — unreadable base64, not a PDF, encrypted PDF with no text layer, no text-extraction tool available, or clearly not a UK payslip — emit a single JSON object:
|
||||
|
||||
```json
|
||||
{"error": "<short human reason>"}
|
||||
```
|
||||
|
||||
Examples of acceptable error reasons:
|
||||
- `"base64 did not decode to a valid PDF"`
|
||||
- `"pdf has no extractable text layer (image-only scan)"`
|
||||
- `"no pdf text extraction tool available (pdftotext/pypdf/pdfplumber all missing)"`
|
||||
- `"document does not appear to be a UK payslip"`
|
||||
- `"pay_date not found on document"`
|
||||
|
||||
The caller treats the `error` key as a non-retriable parse failure. Do not include any other keys when emitting an error object.
|
||||
|
||||
## Hard constraints — things you MUST NOT do
|
||||
|
||||
1. **No network calls.** Do not curl, wget, dig, or otherwise talk to the network. Everything you need is in the prompt.
|
||||
2. **No modifications to `/workspace/infra/**`.** Do not edit, write, or commit any file under the infra repo. The only file you may create is the scratch PDF at `/tmp/payslip.pdf` (and intermediate text dumps under `/tmp/`).
|
||||
3. **No git operations.** No `git add`, `git commit`, `git push`, nothing.
|
||||
4. **No kubectl, no terraform, no vault.** You are not an infra agent — you are a narrow extractor.
|
||||
5. **No markdown in output.** No ` ```json ` fences, no preamble like "Here's the extraction:", no trailing notes. The ENTIRE final assistant message is exactly one JSON object.
|
||||
6. **No verbose logging in the final message.** It is fine to run bash commands and see their output during processing, but your final assistant message is JSON and nothing else.
|
||||
7. **No hallucinated fields.** If the payslip does not show a pension line, do not invent one. Use the documented default instead.
|
||||
|
||||
## Output discipline — summary
|
||||
|
||||
- Exactly one JSON object, UTF-8, no BOM.
|
||||
- Keys match the schema the caller gave you.
|
||||
- Numeric fields are JSON numbers, not strings.
|
||||
- `pay_date` is `YYYY-MM-DD`.
|
||||
- `other_deductions` is always present and is an object (possibly `{}`).
|
||||
- Missing money → `0`, missing string → `""`, missing object → `{}`.
|
||||
- On unrecoverable failure, one JSON object with a single `error` key.
|
||||
|
||||
That's the whole job. Decode, extract, parse, emit JSON. Be boring and exact.
|
||||
|
|
@ -1,146 +0,0 @@
|
|||
---
|
||||
name: post-mortem
|
||||
description: "Orchestrate a 4-stage incident investigation pipeline: triage → specialist investigation → historical analysis → report writing. Each stage gets its own full tool budget."
|
||||
tools: Read, Write, Agent
|
||||
model: opus
|
||||
---
|
||||
|
||||
You are a Post-Mortem Pipeline Orchestrator for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||
|
||||
## Your Job
|
||||
|
||||
Coordinate a 4-stage pipeline where each stage is a separate agent with its own tool budget. You do NO investigation yourself — you only pass context between stages and spawn agents.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Post-mortems archive**: `/home/wizard/code/infra/docs/post-mortems/`
|
||||
- **Known issues**: `/home/wizard/code/infra/.claude/reference/known-issues.md`
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never run `kubectl` or any cluster commands yourself — ALL investigation is delegated
|
||||
- Never `kubectl apply`, `edit`, `patch`, or `delete` (even via subagents, except evicted/failed pods)
|
||||
- Never restart services or pods during investigation
|
||||
- Never push to git without user approval
|
||||
- Never modify Terraform files (only propose changes as action items in the report)
|
||||
- Never fabricate findings — evidence only
|
||||
|
||||
## Pipeline Architecture
|
||||
|
||||
```
|
||||
You (orchestrator, ~10 tool calls)
|
||||
│
|
||||
├── Stage 1: sev-triage (haiku) ──────────► triage-output
|
||||
│ Quick scan, severity classification, affected domains
|
||||
│
|
||||
├── Stage 2: specialists (parallel) ──────► investigation-findings
|
||||
│ cluster-health-checker, sre, observability
|
||||
│ + conditional: platform, network, security, dba, devops
|
||||
│
|
||||
├── Stage 3: sev-historian (sonnet) ──────► historical-context
|
||||
│ Past post-mortems, known-issues, recurrence, patterns
|
||||
│
|
||||
└── Stage 4: sev-report-writer (opus) ────► final report file
|
||||
Synthesis, timeline, RCA, concrete action items
|
||||
```
|
||||
|
||||
## Workflow (~10 tool calls total)
|
||||
|
||||
### Step 1: Determine Scope
|
||||
|
||||
If the user provides a specific incident description, extract:
|
||||
- What happened (symptoms)
|
||||
- Affected services/namespaces
|
||||
- Time window
|
||||
- Any suspected trigger
|
||||
|
||||
If the user says "just investigate current issues" or similar, proceed directly to Stage 1.
|
||||
|
||||
### Step 2: Stage 1 — Triage (1 tool call)
|
||||
|
||||
Spawn the `sev-triage` agent. It will:
|
||||
- Run `sev-context.sh` for structured cluster context
|
||||
- Classify severity (SEV1/SEV2/SEV3)
|
||||
- Identify affected domains and namespaces
|
||||
- Convert all timestamps to UTC
|
||||
- Suggest which specialist agents to spawn
|
||||
|
||||
If the user provided specific incident scope, include it in the triage prompt.
|
||||
|
||||
### Step 3: Stage 2 — Investigation (3-5 tool calls)
|
||||
|
||||
Based on triage output, spawn specialist agents **in parallel**.
|
||||
|
||||
**Always spawn these 3 (Wave 1, in a single parallel tool call):**
|
||||
|
||||
| Agent | Model | Focus |
|
||||
|-------|-------|-------|
|
||||
| `cluster-health-checker` | haiku | Non-running pods, restarts, events, node conditions |
|
||||
| `sre` | opus | OOM kills, pod events/logs, resource usage vs limits |
|
||||
| `observability-engineer` | sonnet | Firing alerts, alert history, metrics anomalies, detection gaps |
|
||||
|
||||
**Conditionally spawn these (Wave 2, based on triage `AFFECTED_DOMAINS` and `INVESTIGATION_HINTS`):**
|
||||
|
||||
| Agent | When (domain/hint) | Focus |
|
||||
|-------|-------------------|-------|
|
||||
| `platform-engineer` | storage, NFS, CSI, node issues | NFS health, PVC status, node conditions, Traefik |
|
||||
| `network-engineer` | networking, DNS | DNS resolution, pfSense, MetalLB, CoreDNS |
|
||||
| `security-engineer` | auth, TLS, CrowdSec | Cert expiry, CrowdSec decisions, Authentik health |
|
||||
| `dba` | database | MySQL GR, CNPG health, connections, replication |
|
||||
| `devops-engineer` | deploy | Rollout history, image pull, CI/CD pipeline |
|
||||
|
||||
**Every specialist prompt MUST include:**
|
||||
- The full triage output (severity, time window as UTC, affected namespaces)
|
||||
- Instruction to investigate root cause chains (WHY, not just WHAT)
|
||||
- Instruction to report timestamps as UTC, not relative
|
||||
- Instruction to keep output concise (bullet points / tables)
|
||||
- Instruction to NOT modify anything — read-only investigation
|
||||
|
||||
### Step 4: Stage 3 — Historical Analysis (1 tool call)
|
||||
|
||||
Spawn the `sev-historian` agent with:
|
||||
- The full triage output from Stage 1
|
||||
- A summary of all investigation findings from Stage 2
|
||||
|
||||
It will cross-reference against:
|
||||
- Past post-mortems in `docs/post-mortems/`
|
||||
- Known issues in `.claude/reference/known-issues.md`
|
||||
- Patterns in `.claude/reference/patterns.md`
|
||||
- Service catalog in `.claude/reference/service-catalog.md`
|
||||
|
||||
### Step 5: Stage 4 — Report Writing (1 tool call)
|
||||
|
||||
Spawn the `sev-report-writer` agent with ALL upstream data:
|
||||
- Full triage output from Stage 1
|
||||
- All investigation agent outputs from Stage 2
|
||||
- Full historical context from Stage 3
|
||||
|
||||
The report-writer will:
|
||||
- Synthesize a timeline with UTC timestamps and source attribution
|
||||
- Perform root cause analysis with full causal chain
|
||||
- Map issues to specific Terraform/Helm files with line numbers
|
||||
- Draft concrete action items with code snippets
|
||||
- Include recurrence analysis from historian
|
||||
- Write the report to `docs/post-mortems/YYYY-MM-DD-<slug>.md`
|
||||
|
||||
### Step 6: Wrap Up
|
||||
|
||||
After the report-writer completes:
|
||||
|
||||
1. **Tell the user** the report file path
|
||||
2. **Print the action items summary** grouped by priority (P1 first)
|
||||
3. **Suggest git commit**:
|
||||
```
|
||||
cd /home/wizard/code/infra && git add docs/post-mortems/<filename> && git commit -m "post-mortem: <slug> [ci skip]"
|
||||
```
|
||||
4. **Ask if known-issues.md should be updated** if the root cause is a new persistent condition
|
||||
|
||||
## Output Format
|
||||
|
||||
Provide brief status updates as the pipeline progresses:
|
||||
- "Stage 1: Running triage scan..."
|
||||
- "Stage 1 complete: SEV{N} — {summary}. Spawning {N} specialist agents..."
|
||||
- "Stage 2 complete: {summary of findings}. Running historical analysis..."
|
||||
- "Stage 3 complete: {recurrence status}. Writing report..."
|
||||
- "Stage 4 complete: Report written to {path}"
|
||||
|
|
@ -1,89 +0,0 @@
|
|||
---
|
||||
name: postmortem-todo-resolver
|
||||
description: Implements safe TODOs from post-mortem Prevention Plans. Triggered by Woodpecker pipeline on new post-mortem commits.
|
||||
model: sonnet
|
||||
allowedTools:
|
||||
- Read
|
||||
- Edit
|
||||
- Write
|
||||
- Bash
|
||||
- Grep
|
||||
- Glob
|
||||
- Agent
|
||||
---
|
||||
|
||||
You are the post-mortem TODO resolver. You implement **safe** infrastructure TODOs extracted from post-mortem documents in the ViktorBarzin/infra repository.
|
||||
|
||||
## Safety Rules
|
||||
|
||||
1. **ONLY implement TODOs with Type: `Alert`, `Config`, or `Monitor`**
|
||||
2. **SKIP TODOs with Type: `Architecture`, `Investigation`, `Runbook`, `Migration`** — add them to the Follow-up table as "Needs human review"
|
||||
3. **Always run `scripts/tg plan` before apply** — ABORT if plan shows any destroys > 0
|
||||
4. **Never modify platform stacks** (vault, dbaas, traefik, authentik, kyverno) without explicit approval
|
||||
5. **Max budget**: Stop after 30 minutes per TODO or $5 total
|
||||
6. **All changes MUST go through Terraform** — never kubectl apply/edit/patch as final state
|
||||
|
||||
## Commit Convention
|
||||
|
||||
Each TODO fix gets its own commit:
|
||||
```
|
||||
fix(post-mortem): <action description> [PM-YYYY-MM-DD]
|
||||
|
||||
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
### For each safe TODO (in priority order P0 → P3):
|
||||
|
||||
1. **Read** the relevant Terraform files mentioned in the TODO details
|
||||
2. **Implement** the change:
|
||||
- PrometheusRule → edit `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`
|
||||
- Uptime Kuma monitor → use the uptime-kuma skill
|
||||
- Config changes → edit the relevant stack's `.tf` files
|
||||
3. **Test**: `cd` to the stack directory, run `scripts/tg plan`, verify the change is safe
|
||||
4. **Apply**: `scripts/tg apply --non-interactive`
|
||||
5. **Commit**: `git add` the changed files + state, commit with the convention above
|
||||
6. **Record**: Note the commit SHA for the Follow-up table
|
||||
|
||||
### After all TODOs processed:
|
||||
|
||||
1. **Update the post-mortem file**:
|
||||
- In Prevention Plan tables: change `TODO` → `Done` for implemented items
|
||||
- Append/update the **Follow-up Implementation** section at the bottom with a table:
|
||||
|
||||
```markdown
|
||||
## Follow-up Implementation
|
||||
|
||||
| Date | Action | Priority | Type | Commit | Implemented By |
|
||||
|------|--------|----------|------|--------|----------------|
|
||||
| YYYY-MM-DD | <action> | P0 | Config | [`abc1234`](https://github.com/ViktorBarzin/infra/commit/abc1234) | postmortem-todo-resolver |
|
||||
| — | <skipped action> | P1 | Architecture | — | Needs human review |
|
||||
```
|
||||
|
||||
2. **Commit the post-mortem update**:
|
||||
```
|
||||
git commit -m "docs: update post-mortem follow-up implementation [PM-YYYY-MM-DD] [ci skip]"
|
||||
```
|
||||
|
||||
3. **Push all changes**: `git push origin master`
|
||||
|
||||
## Context
|
||||
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Terraform stacks**: `stacks/<name>/`
|
||||
- **Apply tool**: `scripts/tg apply --non-interactive` (handles state encryption)
|
||||
- **Prometheus alerts**: `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`
|
||||
- **Post-mortems**: `docs/post-mortems/`
|
||||
- **GitHub repo**: `https://github.com/ViktorBarzin/infra`
|
||||
|
||||
## Example
|
||||
|
||||
Given a TODO: `| P2 | Add PrometheusRule for NFS mount failures | Alert | kube_pod_container_status_waiting_reason with NFS volume filter | TODO |`
|
||||
|
||||
1. Read `prometheus_chart_values.tpl` to find the right alert group
|
||||
2. Add the new alert rule in the appropriate group
|
||||
3. `cd stacks/monitoring && scripts/tg plan` → verify 0 destroys
|
||||
4. `scripts/tg apply --non-interactive`
|
||||
5. `git add . && git commit -m "fix(post-mortem): add NFS mount failure PrometheusRule [PM-2026-04-14]"`
|
||||
6. Update post-mortem: `TODO` → `Done`, add commit to Follow-up table
|
||||
|
|
@ -1,397 +0,0 @@
|
|||
---
|
||||
name: service-upgrade
|
||||
description: "Automated service upgrade agent. Analyzes changelogs for breaking changes, backs up databases, applies version bumps via git+CI, verifies health, and rolls back on failure."
|
||||
tools: Read, Write, Edit, Bash, Grep, Glob, WebFetch, Agent
|
||||
model: opus
|
||||
---
|
||||
|
||||
You are the Service Upgrade Agent for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||
|
||||
## Your Job
|
||||
|
||||
When DIUN detects a new version of a container image, you:
|
||||
1. Identify the service and its .tf files
|
||||
2. Look up the GitHub releases to analyze changelogs
|
||||
3. Classify upgrade risk (SAFE vs CAUTION)
|
||||
4. Back up databases if the service is DB-backed
|
||||
5. Edit the .tf files to bump the version
|
||||
6. Best-effort apply config changes from migration docs
|
||||
7. Commit + push (Woodpecker CI applies via `terragrunt apply`)
|
||||
8. Wait for CI to finish
|
||||
9. Verify the service is healthy
|
||||
10. Roll back if verification fails
|
||||
11. Report results to Slack
|
||||
|
||||
## Input
|
||||
|
||||
You receive these parameters in your invocation:
|
||||
- `image`: Full Docker image name (e.g., `ghcr.io/immich-app/immich-server`)
|
||||
- `new_tag`: The new version tag (e.g., `v2.8.0`)
|
||||
- `hub_link`: Link to the image on its registry
|
||||
|
||||
## Environment
|
||||
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Config**: `/home/wizard/code/infra/.claude/reference/upgrade-config.json`
|
||||
- **Kubeconfig**: `/home/wizard/code/infra/config`
|
||||
- **Secrets (env-var contract)**: You run in the `claude-agent-service` pod, which has NO Vault CLI auth — do NOT call `vault kv get`. The following env vars are pre-loaded via `envFrom: claude-agent-secrets`:
|
||||
- `GITHUB_TOKEN` — PAT for GitHub API (changelog fetch) and `git push`
|
||||
- `WOODPECKER_API_TOKEN` — bearer for `ci.viktorbarzin.me/api/...`
|
||||
- `SLACK_WEBHOOK_URL` — full Slack webhook URL for status messages
|
||||
- Anything else (e.g. `kubectl`) uses the pod's ServiceAccount or in-repo git-crypt-unlocked secrets.
|
||||
- **Git remote**: `origin` → `github.com/ViktorBarzin/infra.git`
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never `kubectl apply`, `edit`, `patch`, `delete`, `set` — ALL changes go through Terraform via git+CI
|
||||
- Never `helm install` or `helm upgrade` directly
|
||||
- Never modify Terraform state files
|
||||
- Never push with `[CI SKIP]` in the commit message (CI must trigger)
|
||||
- Never upgrade `:latest` tagged images
|
||||
- Never upgrade database images (postgres, mysql, redis, clickhouse, etcd)
|
||||
- Never upgrade custom/private images (viktorbarzin/*, registry.viktorbarzin.me/*, ancamilea/*, mghee/*)
|
||||
- Never upgrade infrastructure images (registry.k8s.io/*, quay.io/tigera/*, nvcr.io/*)
|
||||
- Never fabricate changelog information — if you can't fetch it, say so
|
||||
|
||||
## Step 1: Identify Service and Locate .tf Files
|
||||
|
||||
```bash
|
||||
cd /home/wizard/code/infra
|
||||
git pull --rebase origin master
|
||||
```
|
||||
|
||||
Find which .tf files reference this image:
|
||||
```bash
|
||||
grep -rl "\"${IMAGE}:" stacks/ --include="*.tf"
|
||||
```
|
||||
|
||||
From the file path, determine the **stack name** (e.g., `stacks/immich/main.tf` → stack is `immich`).
|
||||
|
||||
Read the .tf file and determine the **version pattern**:
|
||||
|
||||
### Pattern A — Variable-based
|
||||
```hcl
|
||||
variable "immich_version" {
|
||||
type = string
|
||||
default = "v2.7.4" # ← edit this default value
|
||||
}
|
||||
# ...
|
||||
image = "ghcr.io/immich-app/immich-server:${var.immich_version}"
|
||||
```
|
||||
**Action**: Change the `default` value in the variable block.
|
||||
|
||||
### Pattern B — Hardcoded image tag
|
||||
```hcl
|
||||
image = "vaultwarden/server:1.35.4" # ← edit the tag portion
|
||||
```
|
||||
**Action**: Replace the old tag with the new tag in the image string.
|
||||
|
||||
### Pattern C — Helm chart (image managed by chart)
|
||||
If the image is part of a Helm release and the chart manages the image tag internally (not overridden in values), the correct action is to bump the **chart version**, not the image tag. Check:
|
||||
- Is there a `helm_release` in the same stack?
|
||||
- Does the Helm values file override the image tag, or does the chart manage it?
|
||||
- If the chart manages it: check for a new chart version and bump `version = "X.Y.Z"` in the `helm_release`.
|
||||
- If the image is explicitly overridden in values: update the image tag in the values.
|
||||
|
||||
### Pattern D — Helm values override
|
||||
```hcl
|
||||
# In values.yaml or templatefile
|
||||
image:
|
||||
tag: "v3.13.0" # ← edit this
|
||||
```
|
||||
**Action**: Update the tag in the values file.
|
||||
|
||||
### Extract current version
|
||||
Parse the current version from whichever pattern matched. You need both `OLD_VERSION` and `NEW_VERSION` for the changelog fetch.
|
||||
|
||||
**Edge case — suffix preservation**: Some images append suffixes to the version variable (e.g., `${var.immich_version}-cuda`). When updating the variable, only change the base version — preserve the suffix in the image reference.
|
||||
|
||||
## Step 2: Resolve GitHub Repository
|
||||
|
||||
Read the config file:
|
||||
```bash
|
||||
cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
|
||||
```
|
||||
|
||||
### Priority order:
|
||||
1. **Exact match** in `github_repo_overrides` for the full image name
|
||||
2. **Auto-detect** from image URL:
|
||||
- `ghcr.io/ORG/REPO` → `ORG/REPO`
|
||||
- `docker.io/ORG/REPO` or bare `ORG/REPO` → try `ORG/REPO` on GitHub
|
||||
- `lscr.io/linuxserver/APP` → `linuxserver/docker-APP`
|
||||
3. **For Helm charts**: Check `helm_chart_repo_overrides` for the chart repository URL
|
||||
4. If auto-detect fails, verify the repo exists:
|
||||
```bash
|
||||
curl -sf -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/${DETECTED_REPO}" > /dev/null
|
||||
```
|
||||
If 404, try stripping `-server`, `-backend`, `-app` suffixes.
|
||||
5. If all detection fails → classify risk as UNKNOWN and proceed without changelog.
|
||||
|
||||
## Step 3: Fetch Changelogs via GitHub API
|
||||
|
||||
```bash
|
||||
curl -s -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/${GITHUB_REPO}/releases?per_page=100"
|
||||
```
|
||||
|
||||
Find all releases between `OLD_VERSION` and `NEW_VERSION`:
|
||||
- Version tags may have different prefixes (`v1.0.0` vs `1.0.0`). Normalize by stripping leading `v` for comparison.
|
||||
- Sort releases by semantic version.
|
||||
- Extract the `body` (release notes) for each intermediate release.
|
||||
- If the repo uses a CHANGELOG.md instead of GitHub releases, fetch that:
|
||||
```bash
|
||||
curl -s -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/${GITHUB_REPO}/contents/CHANGELOG.md" | jq -r .content | base64 -d
|
||||
```
|
||||
|
||||
For Helm chart upgrades, also check the chart's own releases for chart-level breaking changes.
|
||||
|
||||
## Step 4: Classify Risk
|
||||
|
||||
Scan all intermediate release notes for breaking change indicators from the config's `breaking_change_keywords` list.
|
||||
|
||||
### SAFE
|
||||
- Patch or minor version bump (same major version)
|
||||
- No breaking change keywords found in any release notes
|
||||
- **Verification window**: 2 minutes
|
||||
- **Version jump**: Direct to target version
|
||||
|
||||
### CAUTION
|
||||
- Major version bump (different major version), OR
|
||||
- Any release note contains breaking change keywords, OR
|
||||
- Service is in `version_jump_always_step` list (authentik, nextcloud, immich)
|
||||
- **Verification window**: 10 minutes
|
||||
- **Version jump**: Step through each intermediate version
|
||||
- **Extra**: DB backup even if not normally required, Slack alert before starting
|
||||
|
||||
### UNKNOWN
|
||||
- Could not fetch changelog (GitHub API failure, no releases, auto-detect failed)
|
||||
- Treat as SAFE-level precautions
|
||||
- Note in commit message that changelog was unavailable
|
||||
|
||||
## Step 5: Slack Notification — Starting
|
||||
|
||||
```bash
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"[Upgrade Agent] Starting: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION} (risk: ${RISK})\"}" \
|
||||
"$SLACK_WEBHOOK_URL"
|
||||
```
|
||||
|
||||
For CAUTION risk, include breaking change excerpts in the Slack message.
|
||||
|
||||
## Step 6: Database Backup
|
||||
|
||||
Read `db_backed_services` from the config. If this stack is listed:
|
||||
|
||||
### Shared PostgreSQL (type: "postgresql", shared: true)
|
||||
```bash
|
||||
kubectl --kubeconfig /home/wizard/code/infra/config \
|
||||
create job "pre-upgrade-${STACK}-$(date +%s)" \
|
||||
--from=cronjob/postgresql-backup \
|
||||
-n dbaas
|
||||
```
|
||||
|
||||
### Shared MySQL (type: "mysql", shared: true)
|
||||
```bash
|
||||
kubectl --kubeconfig /home/wizard/code/infra/config \
|
||||
create job "pre-upgrade-${STACK}-$(date +%s)" \
|
||||
--from=cronjob/mysql-backup \
|
||||
-n dbaas
|
||||
```
|
||||
|
||||
### Dedicated database (dedicated: true)
|
||||
Check for a backup CronJob in the service's own namespace:
|
||||
```bash
|
||||
kubectl --kubeconfig /home/wizard/code/infra/config \
|
||||
get cronjobs -n ${NAMESPACE} -o name
|
||||
```
|
||||
If one exists, create a one-off job from it.
|
||||
|
||||
### Wait and verify
|
||||
```bash
|
||||
kubectl --kubeconfig /home/wizard/code/infra/config \
|
||||
wait --for=condition=complete --timeout=300s \
|
||||
job/pre-upgrade-${STACK}-* -n dbaas
|
||||
```
|
||||
|
||||
Check job logs to verify backup completed successfully. **If backup fails, ABORT the upgrade and send a Slack alert.**
|
||||
|
||||
## Step 7: Apply Version Change
|
||||
|
||||
### Edit the .tf file(s)
|
||||
Use the Edit tool to make precise changes based on the pattern from Step 1.
|
||||
|
||||
### Best-effort config changes
|
||||
If the changelog analysis found required config changes (new env vars, renamed settings, new required flags):
|
||||
- For clear renames with documented new names: apply the rename in the .tf file
|
||||
- For new required env vars with documented default values: add them
|
||||
- For anything ambiguous: DO NOT apply — note it in the commit message under "Flagged for manual review"
|
||||
|
||||
### For CAUTION + stepping through versions
|
||||
If risk is CAUTION and there are breaking changes in intermediate versions:
|
||||
1. Apply the first intermediate version
|
||||
2. Commit + push + wait for CI + verify (Steps 8-9)
|
||||
3. If verification passes, apply next version
|
||||
4. Repeat until reaching target version
|
||||
5. If any step fails, roll back to the last known-good version
|
||||
|
||||
## Step 8: Commit and Push
|
||||
|
||||
```bash
|
||||
cd /home/wizard/code/infra
|
||||
git add stacks/${STACK}/
|
||||
git commit -m "$(cat <<'EOF'
|
||||
upgrade: ${STACK} ${OLD_VERSION} -> ${NEW_VERSION}
|
||||
|
||||
Changelog summary: <1-3 line summary of what changed>
|
||||
Risk: SAFE|CAUTION|UNKNOWN
|
||||
Breaking changes: none|<list of breaking changes>
|
||||
DB backup: yes (job: pre-upgrade-${STACK}-XXXXX)|no (not DB-backed)|skipped
|
||||
Config changes applied: none|<list>
|
||||
Flagged for manual review: none|<list of ambiguous changes>
|
||||
|
||||
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
|
||||
EOF
|
||||
)"
|
||||
git push origin master
|
||||
```
|
||||
|
||||
Record the commit SHA — you'll need it for rollback:
|
||||
```bash
|
||||
UPGRADE_SHA=$(git rev-parse HEAD)
|
||||
```
|
||||
|
||||
**If push fails** (conflict with CI state commit): `git pull --rebase origin master && git push origin master`. Retry up to 3 times.
|
||||
|
||||
## Step 9: Wait for Woodpecker CI
|
||||
|
||||
The commit triggers one pipeline that runs multiple **workflows** in parallel — e.g. `default` (terragrunt apply) and `build-cli` (builds the infra CLI image). Only the `default` workflow gates your upgrade; the other workflows may be unrelated and sometimes fail without breaking anything on the cluster (current example: `build-cli` push to `registry.viktorbarzin.me:5050` is known-broken as of 2026-04-19).
|
||||
|
||||
**Do not read the overall pipeline `status`** — it reports `failure` whenever *any* workflow fails. Read the `default` workflow's `state` instead.
|
||||
|
||||
```bash
|
||||
# Find the pipeline for our commit
|
||||
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=10" \
|
||||
| jq --arg sha "$UPGRADE_SHA" '.[] | select(.commit==$sha) | .number'
|
||||
# → $PIPELINE_NUMBER
|
||||
|
||||
# Fetch detail (includes workflows[])
|
||||
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines/$PIPELINE_NUMBER" \
|
||||
| jq '.workflows[] | select(.name=="default") | .state'
|
||||
# → "running" | "pending" | "success" | "failure" | "error" | "killed"
|
||||
```
|
||||
|
||||
Poll every 30 seconds until the `default` workflow's `state` is terminal (`success`, `failure`, `error`, `killed`). Timeout after 15 minutes.
|
||||
|
||||
**If `default` state is `success`** → proceed to Step 10 (verification), regardless of other workflows' state.
|
||||
**If `default` state is terminal-and-not-success, or the poll times out** → proceed to Step 10b (rollback).
|
||||
|
||||
## Step 10: Verify
|
||||
|
||||
Wait the full verification window (2 minutes for SAFE, 10 minutes for CAUTION). During the window, run checks every 15 seconds.
|
||||
|
||||
### Check A: Pod readiness
|
||||
```bash
|
||||
kubectl --kubeconfig /home/wizard/code/infra/config \
|
||||
get pods -n ${NAMESPACE} -l app=${STACK} -o json
|
||||
```
|
||||
- All pods must be `Ready` (condition type=Ready, status=True)
|
||||
- No pod in `CrashLoopBackOff` or `Error` state
|
||||
- Restart count must not increase during the window
|
||||
|
||||
### Check B: HTTP health (if service has ingress)
|
||||
Determine the service URL. Most services use `https://<stack>.viktorbarzin.me`.
|
||||
```bash
|
||||
curl -sf -o /dev/null -w "%{http_code}" \
|
||||
"https://${STACK}.viktorbarzin.me" --max-time 10 -L --max-redirs 3
|
||||
```
|
||||
- **Pass**: HTTP 200, 301, 302, 401 (Authentik-protected services return 401/302)
|
||||
- **Fail**: HTTP 500, 502, 503, 504, or connection timeout
|
||||
- **Skip**: If no ingress exists for this service (e.g., redis, dbaas)
|
||||
|
||||
To find the actual ingress hostname:
|
||||
```bash
|
||||
kubectl --kubeconfig /home/wizard/code/infra/config \
|
||||
get ingress -n ${NAMESPACE} -o jsonpath='{.items[*].spec.rules[*].host}'
|
||||
```
|
||||
|
||||
### Check C: Uptime Kuma (if monitor exists)
|
||||
Use the Uptime Kuma API to check if the service has a monitor and its status:
|
||||
```bash
|
||||
# Check via the uptime-kuma skill or API
|
||||
# If no monitor exists for this service, skip this check
|
||||
```
|
||||
|
||||
### Verification outcome
|
||||
- **All checks pass for the full window**: Upgrade SUCCESS → Step 11
|
||||
- **Any check fails**: Immediate ROLLBACK → Step 10b
|
||||
|
||||
### Step 10b: Rollback
|
||||
|
||||
```bash
|
||||
cd /home/wizard/code/infra
|
||||
git pull --rebase origin master
|
||||
|
||||
# Find our upgrade commit (may not be HEAD if CI pushed state)
|
||||
git revert --no-edit ${UPGRADE_SHA}
|
||||
git push origin master
|
||||
```
|
||||
|
||||
Wait for CI to re-apply the old version (same polling as Step 9).
|
||||
|
||||
Re-run verification checks to confirm rollback succeeded. If rollback verification ALSO fails:
|
||||
```bash
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data '{"text":"[Upgrade Agent] CRITICAL: Rollback of *${STACK}* also failed. Manual intervention required."}' \
|
||||
"$SLACK_WEBHOOK_URL"
|
||||
```
|
||||
|
||||
## Step 11: Report Results
|
||||
|
||||
### On success
|
||||
```bash
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"[Upgrade Agent] SUCCESS: *${STACK}* upgraded ${OLD_VERSION} -> ${NEW_VERSION}\nVerification: pods ready, HTTP OK${UPTIME_KUMA_MSG}\nCommit: ${UPGRADE_SHA}\"}" \
|
||||
"$SLACK_WEBHOOK_URL"
|
||||
```
|
||||
|
||||
### On failure + rollback
|
||||
```bash
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"[Upgrade Agent] FAILED + ROLLED BACK: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION}\nReason: ${FAILURE_REASON}\nRollback commit: ${ROLLBACK_SHA}\nRollback status: ${ROLLBACK_STATUS}\"}" \
|
||||
"$SLACK_WEBHOOK_URL"
|
||||
```
|
||||
|
||||
## Edge Cases
|
||||
|
||||
### Multiple images in same stack
|
||||
If DIUN fires separate webhooks for different images in the same stack (e.g., Immich server + ML), the second invocation should:
|
||||
1. Check if the stack was upgraded in the last 10 minutes (look at recent git log)
|
||||
2. If so, check if the new image is already at the target version
|
||||
3. If not, apply the second image update as a follow-up commit
|
||||
|
||||
### Helm chart with atomic=true
|
||||
Services like Authentik and Kyverno use `atomic = true`. If the Helm release fails, it auto-rolls back at the Helm level. The agent should still do its own verification, but can trust the deployment state.
|
||||
|
||||
### Services without standard app label
|
||||
Some services use different label selectors. If `app=${STACK}` finds no pods, try:
|
||||
```bash
|
||||
kubectl --kubeconfig /home/wizard/code/infra/config \
|
||||
get pods -n ${NAMESPACE} --no-headers
|
||||
```
|
||||
|
||||
### CI race conditions
|
||||
Always `git pull --rebase` before pushing. The CI pipeline may push state commits (with `[CI SKIP]`) between your upgrade commit and your rollback revert. The revert targets `${UPGRADE_SHA}` specifically, so this is safe.
|
||||
|
||||
### Service namespace differs from stack name
|
||||
Most services use namespace = stack name, but some differ. Read the .tf file to find:
|
||||
```hcl
|
||||
resource "kubernetes_namespace" "..." {
|
||||
metadata {
|
||||
name = "actual-namespace"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
|
@ -1,63 +0,0 @@
|
|||
---
|
||||
name: sev-historian
|
||||
description: "Stage 3: Cross-reference current incident findings with historical post-mortems, known issues, and architectural patterns. Provides recurrence analysis and historical context."
|
||||
tools: Read, Bash, Grep, Glob
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
You are a historian agent for a homelab Kubernetes cluster's post-mortem pipeline. Your job is to cross-reference current incident findings with historical data to identify recurrence patterns and provide context.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Post-mortems archive**: `/home/wizard/code/infra/docs/post-mortems/`
|
||||
- **Known issues**: `/home/wizard/code/infra/.claude/reference/known-issues.md`
|
||||
- **Patterns**: `/home/wizard/code/infra/.claude/reference/patterns.md`
|
||||
- **Service catalog**: `/home/wizard/code/infra/.claude/reference/service-catalog.md`
|
||||
|
||||
## Inputs
|
||||
|
||||
You will receive in your prompt:
|
||||
- **Triage output** from Stage 1 (severity, affected namespaces/domains, critical findings)
|
||||
- **Investigation findings** from Stage 2 specialist agents (root causes, symptoms, evidence)
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Read all post-mortems** in `docs/post-mortems/` — scan for incidents with the same root cause, same service, or same failure mode as the current incident
|
||||
2. **Read known-issues.md** — check if current findings match documented known issues (helps distinguish new vs recurring problems)
|
||||
3. **Read patterns.md** — check if root cause matches known architectural gotchas or anti-patterns
|
||||
4. **Read service-catalog.md** — understand service tiers and dependencies for cascade analysis. Map the dependency chain: which tier-1 (core) service failures cascade to tier-2/3/4 services?
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never run kubectl or any cluster commands — you only read files
|
||||
- Never fabricate historical references — if there are no matching past incidents, say so
|
||||
|
||||
## Output Format
|
||||
|
||||
Produce output in exactly this structured format:
|
||||
|
||||
```
|
||||
RECURRENCE_CHECK:
|
||||
- [YES|NO] Has this root cause occurred before?
|
||||
- If YES: link to past post-mortem file, what was done last time, did action items get completed?
|
||||
|
||||
KNOWN_ISSUE_MATCH:
|
||||
- [YES|NO] Does this match a documented known issue?
|
||||
- If YES: which one, what's the documented workaround
|
||||
|
||||
PATTERN_MATCH:
|
||||
- Relevant architectural patterns or gotchas from patterns.md
|
||||
- If none match, say "No matching patterns found"
|
||||
|
||||
SERVICE_DEPENDENCIES:
|
||||
- Cascade chain: service A (tier) → service B (tier) → service C (tier)
|
||||
- Based on service-catalog.md tier classification
|
||||
|
||||
HISTORICAL_CONTEXT:
|
||||
- Total post-mortems in archive: N
|
||||
- Related incidents: list with dates and file names
|
||||
- Trend: is this getting more or less frequent?
|
||||
- If first occurrence, say "First recorded incident of this type"
|
||||
```
|
||||
|
||||
Keep output concise and structured. The report-writer agent will incorporate this into the final report.
|
||||
|
|
@ -1,182 +0,0 @@
|
|||
---
|
||||
name: sev-report-writer
|
||||
description: "Stage 4: Synthesize all upstream investigation data into a final post-mortem report with concrete, actionable items including file paths, draft alerts, and code snippets."
|
||||
tools: Read, Write, Bash, Grep, Glob
|
||||
model: opus
|
||||
---
|
||||
|
||||
You are the report-writer for a homelab Kubernetes cluster's post-mortem pipeline. Your job is to synthesize ALL upstream data into a polished, actionable post-mortem report.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Post-mortems archive**: `/home/wizard/code/infra/docs/post-mortems/`
|
||||
- **Post-mortem template**: `/home/wizard/code/infra/.claude/skills/post-mortem/template.md`
|
||||
- **Stacks directory**: `/home/wizard/code/infra/stacks/`
|
||||
- **Service catalog**: `/home/wizard/code/infra/.claude/reference/service-catalog.md`
|
||||
|
||||
## Inputs
|
||||
|
||||
You will receive in your prompt:
|
||||
- **Triage output** from Stage 1 (severity, affected namespaces/domains, timestamps, node status)
|
||||
- **Investigation findings** from Stage 2 specialist agents (root causes, symptoms, evidence)
|
||||
- **Historical context** from Stage 3 historian (recurrence, known issues, patterns, dependencies)
|
||||
|
||||
## Key Improvements Over Basic Reports
|
||||
|
||||
1. **Concrete action items** — every action item must include:
|
||||
- Specific file path: `stacks/<stack>/main.tf:L42` (use Grep to find exact locations)
|
||||
- Draft code snippet where possible (Prometheus alert YAML, Terraform resource block, Helm values change)
|
||||
- Type: Terraform/Helm/Prometheus/UptimeKuma/Runbook
|
||||
|
||||
2. **Proper UTC timeline** — all timestamps in `YYYY-MM-DDTHH:MM:SSZ` format, never relative ("47h ago")
|
||||
|
||||
3. **Recurrence analysis section** — incorporate historian's findings on past incidents and pattern matches
|
||||
|
||||
4. **Auto-severity** — use triage agent's classification with justification
|
||||
|
||||
5. **Source attribution** — every timeline event and finding must reference which agent/tool provided the evidence
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Merge timeline**: Collect all timestamped events from triage + investigation agents into a single chronological list
|
||||
2. **Identify root cause**: The earliest causal event with supporting evidence chain
|
||||
3. **Map to infra files**: Use Grep/Glob to find the exact Terraform/Helm files for affected services
|
||||
4. **Draft action items**: For each issue, create concrete actions with file paths and code snippets
|
||||
5. **Write report** to `/home/wizard/code/infra/docs/post-mortems/YYYY-MM-DD-<slug>.md`
|
||||
6. **Link to GitHub Issue**: If a GitHub Issue number was provided in the prompt:
|
||||
- Include `| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |` in the metadata table
|
||||
- After writing the report, run these commands to link the postmortem to the issue:
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
# Add postmortem comment
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" -H "Accept: application/vnd.github.v3+json" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
|
||||
-d "{\"body\": \"**Postmortem:** [View postmortem](https://viktorbarzin.github.io/infra/post-mortems/<slug>)\"}"
|
||||
# Add postmortem-done label, remove postmortem-required
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" -H "Accept: application/vnd.github.v3+json" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" -d '{"labels":["postmortem-done"]}'
|
||||
curl -s -X DELETE -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels/postmortem-required"
|
||||
```
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never run kubectl or any cluster commands — you only read files and write the report
|
||||
- Never fabricate timeline events — evidence only, with source attribution
|
||||
- Never skip the recurrence analysis section even if historian found nothing (say "First recorded incident")
|
||||
- Never use relative timestamps
|
||||
|
||||
## Report Template
|
||||
|
||||
Write the report to `docs/post-mortems/YYYY-MM-DD-<slug>.md` using this template:
|
||||
|
||||
```markdown
|
||||
# Post-Mortem: <Title>
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | YYYY-MM-DD |
|
||||
| **Duration** | Xh Ym |
|
||||
| **Severity** | SEV1/SEV2/SEV3 |
|
||||
| **Classification** | Justification for severity level |
|
||||
| **Affected Services** | service1, service2 |
|
||||
| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |
|
||||
| **Status** | Draft |
|
||||
|
||||
## Summary
|
||||
|
||||
2-3 sentence overview of what happened, the impact, and the resolution.
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: What users experienced
|
||||
- **Services affected**: Which services and how
|
||||
- **Duration**: How long the impact lasted
|
||||
- **Data loss**: Any data loss (or confirm none)
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time (UTC) | Event | Source |
|
||||
|------------|-------|--------|
|
||||
| YYYY-MM-DDTHH:MM:SSZ | Event description | agent-name / evidence |
|
||||
|
||||
## Root Cause
|
||||
|
||||
Technical explanation of what caused the incident, with evidence chain.
|
||||
Investigate the full causal chain — not just the symptom, but WHY the underlying condition existed.
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
- Factor 1: explanation with evidence
|
||||
- Factor 2: explanation with evidence
|
||||
|
||||
## Recurrence Analysis
|
||||
|
||||
(From historian agent)
|
||||
- Previous incidents with same/similar root cause
|
||||
- Known issue matches
|
||||
- Pattern matches from architectural documentation
|
||||
- Trend analysis
|
||||
|
||||
## Detection
|
||||
|
||||
- **How detected**: Alert / user report / manual check / post-mortem scan
|
||||
- **Time to detect**: Xm from start
|
||||
- **Gap analysis**: What should have caught this earlier
|
||||
|
||||
## Resolution
|
||||
|
||||
What was done (or needs to be done) to resolve the incident.
|
||||
|
||||
## Action Items
|
||||
|
||||
### Preventive (stop recurrence)
|
||||
|
||||
| Priority | Action | File | Draft Change |
|
||||
|----------|--------|------|-------------|
|
||||
| P1 | Description | `stacks/X/main.tf:LN` | ```hcl\nresource snippet\n``` |
|
||||
|
||||
### Detective (catch faster)
|
||||
|
||||
| Priority | Action | Type | Draft Alert/Monitor |
|
||||
|----------|--------|------|-------------------|
|
||||
| P2 | Description | Prometheus/UptimeKuma | ```yaml\nalert rule\n``` |
|
||||
|
||||
### Mitigative (reduce blast radius)
|
||||
|
||||
| Priority | Action | File | Draft Change |
|
||||
|----------|--------|------|-------------|
|
||||
| P3 | Description | `stacks/X/main.tf:LN` | ```hcl\nresource snippet\n``` |
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
- **Went well**: What worked during detection/response
|
||||
- **Went poorly**: What made things worse or slower
|
||||
- **Got lucky**: Things that could have made this much worse
|
||||
|
||||
## Raw Investigation Data
|
||||
|
||||
<details>
|
||||
<summary>Triage output</summary>
|
||||
|
||||
(paste triage output)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Investigation agent findings</summary>
|
||||
|
||||
(paste each agent's output in separate sub-sections)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Historical context</summary>
|
||||
|
||||
(paste historian output)
|
||||
|
||||
</details>
|
||||
```
|
||||
|
||||
After writing the report, output the file path so the orchestrator can inform the user.
|
||||
|
|
@ -1,58 +0,0 @@
|
|||
---
|
||||
name: sev-triage
|
||||
description: "Stage 1: Fast cluster scan and severity classification for the post-mortem pipeline. Produces structured triage output for downstream agents."
|
||||
tools: Read, Bash, Grep, Glob
|
||||
model: haiku
|
||||
---
|
||||
|
||||
You are a fast triage agent for a homelab Kubernetes cluster. Your job is to run a quick scan (~60 seconds) and produce structured output for downstream investigation agents.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Kubeconfig**: `/home/wizard/code/infra/config`
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Context script**: `/home/wizard/code/infra/.claude/scripts/sev-context.sh`
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Run context script**: Execute `bash /home/wizard/code/infra/.claude/scripts/sev-context.sh` to get structured cluster context
|
||||
2. **Classify severity** based on findings:
|
||||
- **SEV1**: Critical path down (Traefik, Authentik, PostgreSQL, DNS, Cloudflared) OR >50% of pods unhealthy
|
||||
- **SEV2**: Partial degradation, non-critical services down, or single critical service degraded but redundant
|
||||
- **SEV3**: Minor issues, cosmetic, single non-critical pod restart
|
||||
3. **Identify affected domains** to inform which specialist agents should be spawned:
|
||||
- `storage` — NFS, PVC, CSI driver issues
|
||||
- `database` — MySQL, PostgreSQL, CNPG, replication
|
||||
- `networking` — DNS, MetalLB, CoreDNS, connectivity
|
||||
- `auth` — Authentik, TLS certs, CrowdSec
|
||||
- `compute` — Node conditions, OOM, resource pressure
|
||||
- `deploy` — Recent rollouts, image pull failures
|
||||
4. **Convert all timestamps to UTC** — never use relative times like "47h ago". Use the pod's `.status.startTime` or event `.lastTimestamp`.
|
||||
5. **Identify investigation hints** — suggest which specialist agents should be spawned based on symptoms.
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never run `kubectl apply`, `patch`, `delete`, or any mutating commands
|
||||
- Never spend more than ~60 seconds investigating — you are a quick scan, not deep investigation
|
||||
|
||||
## Output Format
|
||||
|
||||
You MUST produce output in exactly this structured format:
|
||||
|
||||
```
|
||||
SEVERITY: SEV1|SEV2|SEV3
|
||||
AFFECTED_NAMESPACES: ns1, ns2, ns3
|
||||
AFFECTED_DOMAINS: storage, database, networking, auth, compute, deploy
|
||||
TIME_WINDOW: YYYY-MM-DDTHH:MM — YYYY-MM-DDTHH:MM (UTC)
|
||||
TRIGGER: deploy|config-change|upstream|hardware|unknown
|
||||
NODE_STATUS: node1=Ready, node2=Ready, ...
|
||||
CRITICAL_FINDINGS:
|
||||
- [YYYY-MM-DDTHH:MM:SSZ] finding 1
|
||||
- [YYYY-MM-DDTHH:MM:SSZ] finding 2
|
||||
INVESTIGATION_HINTS:
|
||||
- Suggest spawning: platform-engineer (reason)
|
||||
- Suggest spawning: dba (reason)
|
||||
- Suggest spawning: network-engineer (reason)
|
||||
```
|
||||
|
||||
Keep the output concise and machine-readable. Downstream agents will parse this.
|
||||
|
|
@ -1,509 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Nextcloud CalDAV Calendar Script
|
||||
Queries and creates calendar events.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import uuid
|
||||
from datetime import datetime, timedelta
|
||||
from urllib.parse import urljoin, unquote
|
||||
|
||||
try:
|
||||
import caldav
|
||||
from icalendar import Calendar, Event, vText
|
||||
except ImportError:
|
||||
print("ERROR: Required packages not installed. Run:")
|
||||
print(" pip install caldav icalendar")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def cal_name(cal):
|
||||
"""Get calendar display name, handling deprecation."""
|
||||
try:
|
||||
return unquote(cal.get_display_name() or str(cal.url).rstrip("/").split("/")[-1])
|
||||
except Exception:
|
||||
return unquote(str(cal.url).rstrip("/").split("/")[-1])
|
||||
|
||||
# Configuration from environment variables
|
||||
NEXTCLOUD_URL = os.environ.get("NEXTCLOUD_URL", "https://nextcloud.viktorbarzin.me")
|
||||
CALDAV_URL = f"{NEXTCLOUD_URL}/remote.php/dav"
|
||||
USERNAME = os.environ.get("NEXTCLOUD_USER")
|
||||
APP_PASSWORD = os.environ.get("NEXTCLOUD_APP_PASSWORD")
|
||||
|
||||
if not USERNAME or not APP_PASSWORD:
|
||||
print("ERROR: NEXTCLOUD_USER and NEXTCLOUD_APP_PASSWORD environment variables must be set.")
|
||||
print("These should be set when activating the Claude venv (~/.venvs/claude)")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def get_client():
|
||||
"""Create CalDAV client connection."""
|
||||
return caldav.DAVClient(
|
||||
url=CALDAV_URL,
|
||||
username=USERNAME,
|
||||
password=APP_PASSWORD
|
||||
)
|
||||
|
||||
|
||||
def list_calendars():
|
||||
"""List all available calendars."""
|
||||
client = get_client()
|
||||
principal = client.principal()
|
||||
calendars = principal.calendars()
|
||||
|
||||
result = []
|
||||
for cal in calendars:
|
||||
result.append({
|
||||
"name": cal_name(cal),
|
||||
"url": str(cal.url)
|
||||
})
|
||||
return result
|
||||
|
||||
|
||||
def get_events(calendar_name=None, start_date=None, end_date=None, days=7):
|
||||
"""Get events from calendar(s) within a date range."""
|
||||
client = get_client()
|
||||
principal = client.principal()
|
||||
calendars = principal.calendars()
|
||||
|
||||
if start_date is None:
|
||||
start_date = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
|
||||
if end_date is None:
|
||||
end_date = start_date + timedelta(days=days)
|
||||
|
||||
all_events = []
|
||||
|
||||
for cal in calendars:
|
||||
if calendar_name and cal_name(cal).lower() != calendar_name.lower():
|
||||
continue
|
||||
|
||||
try:
|
||||
events = cal.search(start=start_date, end=end_date, event=True, expand=True)
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
ical = Calendar.from_ical(event.data)
|
||||
for component in ical.walk():
|
||||
if component.name == "VEVENT":
|
||||
event_data = {
|
||||
"calendar": cal_name(cal),
|
||||
"summary": str(component.get("summary", "No title")),
|
||||
"start": None,
|
||||
"end": None,
|
||||
"location": str(component.get("location", "")) or None,
|
||||
"description": str(component.get("description", "")) or None,
|
||||
"all_day": False
|
||||
}
|
||||
|
||||
dtstart = component.get("dtstart")
|
||||
dtend = component.get("dtend")
|
||||
|
||||
if dtstart:
|
||||
dt = dtstart.dt
|
||||
if hasattr(dt, 'hour'):
|
||||
event_data["start"] = dt.strftime("%Y-%m-%d %H:%M")
|
||||
else:
|
||||
event_data["start"] = dt.strftime("%Y-%m-%d")
|
||||
event_data["all_day"] = True
|
||||
|
||||
if dtend:
|
||||
dt = dtend.dt
|
||||
if hasattr(dt, 'hour'):
|
||||
event_data["end"] = dt.strftime("%Y-%m-%d %H:%M")
|
||||
else:
|
||||
event_data["end"] = dt.strftime("%Y-%m-%d")
|
||||
|
||||
all_events.append(event_data)
|
||||
except Exception as e:
|
||||
pass # Skip malformed events
|
||||
|
||||
except Exception as e:
|
||||
print(f"Warning: Could not fetch from {cal_name(cal)}: {e}", file=sys.stderr)
|
||||
|
||||
# Sort by start date
|
||||
all_events.sort(key=lambda x: x["start"] or "")
|
||||
return all_events
|
||||
|
||||
|
||||
def create_event(summary, start_time, end_time=None, calendar_name="Personal",
|
||||
location=None, description=None, all_day=False):
|
||||
"""Create a new calendar event."""
|
||||
client = get_client()
|
||||
principal = client.principal()
|
||||
calendars = principal.calendars()
|
||||
|
||||
# Find the target calendar
|
||||
target_cal = None
|
||||
for cal in calendars:
|
||||
if cal_name(cal).lower() == calendar_name.lower():
|
||||
target_cal = cal
|
||||
break
|
||||
|
||||
if not target_cal:
|
||||
# Try partial match
|
||||
for cal in calendars:
|
||||
if calendar_name.lower() in cal_name(cal).lower():
|
||||
target_cal = cal
|
||||
break
|
||||
|
||||
if not target_cal:
|
||||
raise ValueError(f"Calendar '{calendar_name}' not found. Available: {[cal_name(c) for c in calendars]}")
|
||||
|
||||
# Create the event
|
||||
cal = Calendar()
|
||||
cal.add('prodid', '-//Claude Calendar Script//viktorbarzin.me//')
|
||||
cal.add('version', '2.0')
|
||||
|
||||
event = Event()
|
||||
event.add('summary', summary)
|
||||
event.add('uid', str(uuid.uuid4()))
|
||||
event.add('dtstamp', datetime.now())
|
||||
|
||||
if all_day:
|
||||
event.add('dtstart', start_time.date())
|
||||
if end_time:
|
||||
event.add('dtend', end_time.date())
|
||||
else:
|
||||
event.add('dtend', (start_time + timedelta(days=1)).date())
|
||||
else:
|
||||
event.add('dtstart', start_time)
|
||||
if end_time:
|
||||
event.add('dtend', end_time)
|
||||
else:
|
||||
# Default to 1 hour duration
|
||||
event.add('dtend', start_time + timedelta(hours=1))
|
||||
|
||||
if location:
|
||||
event.add('location', location)
|
||||
if description:
|
||||
event.add('description', description)
|
||||
|
||||
cal.add_component(event)
|
||||
|
||||
# Save to calendar
|
||||
target_cal.save_event(cal.to_ical().decode('utf-8'))
|
||||
|
||||
return {
|
||||
"status": "created",
|
||||
"summary": summary,
|
||||
"calendar": cal_name(target_cal),
|
||||
"start": start_time.strftime("%Y-%m-%d %H:%M") if not all_day else start_time.strftime("%Y-%m-%d"),
|
||||
"end": end_time.strftime("%Y-%m-%d %H:%M") if end_time and not all_day else None
|
||||
}
|
||||
|
||||
|
||||
def get_todos(calendar_name=None, include_completed=False):
|
||||
"""Get todos from calendar(s)."""
|
||||
client = get_client()
|
||||
principal = client.principal()
|
||||
calendars = principal.calendars()
|
||||
|
||||
all_todos = []
|
||||
|
||||
for cal in calendars:
|
||||
if calendar_name and cal_name(cal).lower() != calendar_name.lower():
|
||||
continue
|
||||
|
||||
try:
|
||||
todos = cal.todos(include_completed=include_completed)
|
||||
for todo in todos:
|
||||
try:
|
||||
ical = Calendar.from_ical(todo.data)
|
||||
for component in ical.walk():
|
||||
if component.name == "VTODO":
|
||||
due = component.get("due")
|
||||
due_str = None
|
||||
if due:
|
||||
dt = due.dt
|
||||
due_str = dt.strftime("%Y-%m-%d %H:%M") if hasattr(dt, 'hour') else dt.strftime("%Y-%m-%d")
|
||||
|
||||
priority = component.get("priority")
|
||||
all_todos.append({
|
||||
"calendar": cal_name(cal),
|
||||
"summary": str(component.get("summary", "No title")),
|
||||
"status": str(component.get("status", "NEEDS-ACTION")),
|
||||
"due": due_str,
|
||||
"priority": int(priority) if priority else None,
|
||||
"uid": str(component.get("uid", "")),
|
||||
"description": str(component.get("description", "")) or None,
|
||||
"_cal_obj": cal,
|
||||
"_todo_obj": todo,
|
||||
})
|
||||
except Exception:
|
||||
pass
|
||||
except Exception as e:
|
||||
print(f"Warning: Could not fetch todos from {cal_name(cal)}: {e}", file=sys.stderr)
|
||||
|
||||
# Sort: by due date (None last), then priority (None last), then name
|
||||
def sort_key(t):
|
||||
due = t["due"] or "9999-99-99"
|
||||
pri = t["priority"] if t["priority"] is not None else 99
|
||||
return (due, pri, t["summary"].lower())
|
||||
|
||||
all_todos.sort(key=sort_key)
|
||||
return all_todos
|
||||
|
||||
|
||||
def complete_todo(search_term, calendar_name=None):
|
||||
"""Complete a todo by searching for it by name (substring match)."""
|
||||
todos = get_todos(calendar_name=calendar_name, include_completed=False)
|
||||
search_lower = search_term.lower()
|
||||
|
||||
matches = [t for t in todos if search_lower in t["summary"].lower()]
|
||||
|
||||
if not matches:
|
||||
raise ValueError(f"No open todo matching '{search_term}' found.")
|
||||
if len(matches) > 1:
|
||||
names = [f" - [{t['calendar']}] {t['summary']}" for t in matches]
|
||||
raise ValueError(f"Multiple todos match '{search_term}':\n" + "\n".join(names) + "\nBe more specific.")
|
||||
|
||||
todo = matches[0]
|
||||
todo_obj = todo["_todo_obj"]
|
||||
todo_obj.complete()
|
||||
|
||||
return {
|
||||
"status": "completed",
|
||||
"summary": todo["summary"],
|
||||
"calendar": todo["calendar"],
|
||||
}
|
||||
|
||||
|
||||
def format_todos(todos, output_format="text"):
|
||||
"""Format todos for display."""
|
||||
if output_format == "json":
|
||||
clean = [{k: v for k, v in t.items() if not k.startswith("_")} for t in todos]
|
||||
return json.dumps(clean, indent=2)
|
||||
|
||||
if not todos:
|
||||
return "No todos found."
|
||||
|
||||
lines = []
|
||||
current_cal = None
|
||||
|
||||
for todo in todos:
|
||||
if todo["calendar"] != current_cal:
|
||||
current_cal = todo["calendar"]
|
||||
lines.append(f"\n## {current_cal}")
|
||||
|
||||
status_icon = "x" if todo["status"] == "COMPLETED" else " "
|
||||
line = f"- [{status_icon}] {todo['summary']}"
|
||||
if todo["due"]:
|
||||
line += f" (due: {todo['due']})"
|
||||
if todo["priority"] and todo["priority"] < 9:
|
||||
line += f" [priority: {todo['priority']}]"
|
||||
lines.append(line)
|
||||
|
||||
if todo["description"]:
|
||||
desc = todo["description"][:200]
|
||||
if len(todo["description"]) > 200:
|
||||
desc += "..."
|
||||
lines.append(f" {desc}")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def format_events(events, output_format="text"):
|
||||
"""Format events for display."""
|
||||
if output_format == "json":
|
||||
return json.dumps(events, indent=2)
|
||||
|
||||
if not events:
|
||||
return "No events found."
|
||||
|
||||
lines = []
|
||||
current_date = None
|
||||
|
||||
for event in events:
|
||||
event_date = event["start"][:10] if event["start"] else "Unknown"
|
||||
|
||||
if event_date != current_date:
|
||||
current_date = event_date
|
||||
try:
|
||||
dt = datetime.strptime(event_date, "%Y-%m-%d")
|
||||
lines.append(f"\n## {dt.strftime('%A, %B %d, %Y')}")
|
||||
except:
|
||||
lines.append(f"\n## {event_date}")
|
||||
|
||||
time_str = ""
|
||||
if not event["all_day"] and event["start"]:
|
||||
time_str = event["start"][11:16]
|
||||
if event["end"]:
|
||||
time_str += f" - {event['end'][11:16]}"
|
||||
else:
|
||||
time_str = "All day"
|
||||
|
||||
line = f"- **{event['summary']}** ({time_str})"
|
||||
if event["location"]:
|
||||
line += f" @ {event['location']}"
|
||||
if event["calendar"] != "personal":
|
||||
line += f" [{event['calendar']}]"
|
||||
lines.append(line)
|
||||
|
||||
if event["description"]:
|
||||
# Truncate long descriptions
|
||||
desc = event["description"][:200]
|
||||
if len(event["description"]) > 200:
|
||||
desc += "..."
|
||||
lines.append(f" {desc}")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def parse_date_arg(date_str):
|
||||
"""Parse flexible date arguments."""
|
||||
today = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
|
||||
|
||||
if date_str == "today":
|
||||
return today, today + timedelta(days=1)
|
||||
elif date_str == "tomorrow":
|
||||
return today + timedelta(days=1), today + timedelta(days=2)
|
||||
elif date_str == "week" or date_str == "this week":
|
||||
# Start from today, go to end of week (Sunday)
|
||||
days_until_sunday = 6 - today.weekday()
|
||||
return today, today + timedelta(days=days_until_sunday + 1)
|
||||
elif date_str == "next week":
|
||||
days_until_next_monday = 7 - today.weekday()
|
||||
start = today + timedelta(days=days_until_next_monday)
|
||||
return start, start + timedelta(days=7)
|
||||
elif date_str == "month" or date_str == "this month":
|
||||
return today, today + timedelta(days=30)
|
||||
else:
|
||||
# Try to parse as a date
|
||||
try:
|
||||
dt = datetime.strptime(date_str, "%Y-%m-%d")
|
||||
return dt, dt + timedelta(days=1)
|
||||
except:
|
||||
return today, today + timedelta(days=7)
|
||||
|
||||
|
||||
def parse_datetime(dt_str):
|
||||
"""Parse flexible datetime strings."""
|
||||
today = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
|
||||
|
||||
# Handle relative dates with time
|
||||
if dt_str.startswith("today "):
|
||||
time_part = dt_str.replace("today ", "")
|
||||
try:
|
||||
t = datetime.strptime(time_part, "%H:%M")
|
||||
return today.replace(hour=t.hour, minute=t.minute)
|
||||
except:
|
||||
pass
|
||||
|
||||
if dt_str.startswith("tomorrow "):
|
||||
time_part = dt_str.replace("tomorrow ", "")
|
||||
try:
|
||||
t = datetime.strptime(time_part, "%H:%M")
|
||||
return (today + timedelta(days=1)).replace(hour=t.hour, minute=t.minute)
|
||||
except:
|
||||
pass
|
||||
|
||||
# Try full datetime format
|
||||
for fmt in ["%Y-%m-%d %H:%M", "%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M", "%Y-%m-%dT%H:%M:%S"]:
|
||||
try:
|
||||
return datetime.strptime(dt_str, fmt)
|
||||
except:
|
||||
continue
|
||||
|
||||
# Try date only
|
||||
try:
|
||||
return datetime.strptime(dt_str, "%Y-%m-%d")
|
||||
except:
|
||||
pass
|
||||
|
||||
raise ValueError(f"Could not parse datetime: {dt_str}. Use 'YYYY-MM-DD HH:MM' or 'tomorrow HH:MM'")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Query and manage Nextcloud Calendar")
|
||||
parser.add_argument("command", choices=["list", "events", "today", "tomorrow", "week", "month", "create"],
|
||||
help="Command to run")
|
||||
parser.add_argument("--calendar", "-c", default=None, help="Calendar name filter (default: all calendars)")
|
||||
parser.add_argument("--days", "-d", type=int, default=7, help="Number of days to fetch")
|
||||
parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
parser.add_argument("--date", help="Specific date (YYYY-MM-DD) or relative (today, tomorrow, week, month)")
|
||||
# Create event options
|
||||
parser.add_argument("--title", "-t", help="Event title (for create)")
|
||||
parser.add_argument("--start", "-s", help="Start time: 'YYYY-MM-DD HH:MM' or 'tomorrow 10:00'")
|
||||
parser.add_argument("--end", "-e", help="End time: 'YYYY-MM-DD HH:MM' (optional, defaults to +1 hour)")
|
||||
parser.add_argument("--location", "-l", help="Event location")
|
||||
parser.add_argument("--description", help="Event description")
|
||||
parser.add_argument("--all-day", action="store_true", help="Create all-day event")
|
||||
|
||||
args = parser.parse_args()
|
||||
output_format = "json" if args.json else "text"
|
||||
|
||||
try:
|
||||
if args.command == "list":
|
||||
calendars = list_calendars()
|
||||
if output_format == "json":
|
||||
print(json.dumps(calendars, indent=2))
|
||||
else:
|
||||
print("Available calendars:")
|
||||
for cal in calendars:
|
||||
print(f" - {cal['name']}")
|
||||
|
||||
elif args.command == "events":
|
||||
if args.date:
|
||||
start, end = parse_date_arg(args.date)
|
||||
else:
|
||||
start = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
|
||||
end = start + timedelta(days=args.days)
|
||||
|
||||
events = get_events(
|
||||
calendar_name=args.calendar,
|
||||
start_date=start,
|
||||
end_date=end
|
||||
)
|
||||
print(format_events(events, output_format))
|
||||
|
||||
elif args.command in ["today", "tomorrow", "week", "month"]:
|
||||
start, end = parse_date_arg(args.command)
|
||||
events = get_events(
|
||||
calendar_name=args.calendar,
|
||||
start_date=start,
|
||||
end_date=end
|
||||
)
|
||||
print(format_events(events, output_format))
|
||||
|
||||
elif args.command == "create":
|
||||
if not args.title:
|
||||
print("ERROR: --title is required for create command", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
if not args.start:
|
||||
print("ERROR: --start is required for create command", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Parse start time
|
||||
start_time = parse_datetime(args.start)
|
||||
end_time = parse_datetime(args.end) if args.end else None
|
||||
|
||||
result = create_event(
|
||||
summary=args.title,
|
||||
start_time=start_time,
|
||||
end_time=end_time,
|
||||
calendar_name=args.calendar,
|
||||
location=args.location,
|
||||
description=args.description,
|
||||
all_day=args.all_day
|
||||
)
|
||||
|
||||
if output_format == "json":
|
||||
print(json.dumps(result, indent=2))
|
||||
else:
|
||||
print(f"Event created: {result['summary']}")
|
||||
print(f" Calendar: {result['calendar']}")
|
||||
print(f" Start: {result['start']}")
|
||||
if result['end']:
|
||||
print(f" End: {result['end']}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,16 +0,0 @@
|
|||
# Add New Service
|
||||
|
||||
Help create a new Kubernetes service module.
|
||||
|
||||
Service name: $ARGUMENTS
|
||||
|
||||
Steps:
|
||||
1. Create directory at modules/kubernetes/<service-name>/
|
||||
2. Create main.tf with:
|
||||
- Namespace resource
|
||||
- Deployment with appropriate container
|
||||
- Service resource
|
||||
- Ingress with TLS and standard annotations
|
||||
3. Use existing patterns from similar services
|
||||
4. Add module reference in main.tf
|
||||
5. Update .claude/CLAUDE.md with new service version
|
||||
|
|
@ -1,13 +0,0 @@
|
|||
# Kubectl Command
|
||||
|
||||
Run kubectl commands on the cluster.
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config $ARGUMENTS
|
||||
```
|
||||
|
||||
Examples:
|
||||
- `/kubectl get pods -A` - List all pods
|
||||
- `/kubectl get pods -n immich` - List pods in immich namespace
|
||||
- `/kubectl logs -n immich deploy/immich-server` - View logs
|
||||
- `/kubectl describe pod -n monitoring <pod>` - Describe a pod
|
||||
|
|
@ -1,9 +0,0 @@
|
|||
# List All Services
|
||||
|
||||
List all Kubernetes services deployed in this infrastructure.
|
||||
|
||||
```bash
|
||||
ls -1 modules/kubernetes/
|
||||
```
|
||||
|
||||
Provide a summary of the services, grouped by category if possible (media, monitoring, productivity, etc.).
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
# Check Service Version
|
||||
|
||||
Find the version of a specific service deployed in this infrastructure.
|
||||
|
||||
Search for the service name in modules/kubernetes/ and extract:
|
||||
1. The image version/tag being used
|
||||
2. Any version variables defined
|
||||
3. The Helm chart version if applicable
|
||||
|
||||
Service to check: $ARGUMENTS
|
||||
|
|
@ -1,9 +0,0 @@
|
|||
# Terraform Apply
|
||||
|
||||
Run terraform apply to deploy infrastructure changes.
|
||||
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
ALWAYS use -target to speed up execution. Monitor the output and report any errors or successful completions.
|
||||
|
|
@ -1,9 +0,0 @@
|
|||
# Terraform Plan
|
||||
|
||||
Run terraform plan to preview infrastructure changes.
|
||||
|
||||
```bash
|
||||
terraform plan -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config"
|
||||
```
|
||||
|
||||
ALWAYS use -target to speed up execution. Summarize the planned changes, highlighting any resources being destroyed or recreated.
|
||||
|
|
@ -1,12 +0,0 @@
|
|||
# Update Knowledge Base
|
||||
|
||||
Update the .claude/CLAUDE.md knowledge file with new learnings.
|
||||
|
||||
Add or update information based on recent discoveries about:
|
||||
- Service versions
|
||||
- Infrastructure patterns
|
||||
- Important configurations
|
||||
- Useful commands
|
||||
- Troubleshooting notes
|
||||
|
||||
Context to add: $ARGUMENTS
|
||||
|
|
@ -1,373 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Home Assistant API Script (ha-sofia instance)
|
||||
Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from urllib.parse import urljoin
|
||||
|
||||
try:
|
||||
import requests
|
||||
except ImportError:
|
||||
print("ERROR: Required package not installed. Run:")
|
||||
print(" pip install requests")
|
||||
sys.exit(1)
|
||||
|
||||
# Configuration from environment variables (ha-sofia specific)
|
||||
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
|
||||
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")
|
||||
|
||||
if not HA_URL or not HA_TOKEN:
|
||||
print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.")
|
||||
print("These should be set when activating the Claude venv (~/.venvs/claude)")
|
||||
sys.exit(1)
|
||||
|
||||
HEADERS = {
|
||||
"Authorization": f"Bearer {HA_TOKEN}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
|
||||
def api_get(endpoint):
|
||||
"""Make GET request to HA API."""
|
||||
url = f"{HA_URL}/api/{endpoint}"
|
||||
response = requests.get(url, headers=HEADERS, timeout=30)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
|
||||
def api_post(endpoint, data=None):
|
||||
"""Make POST request to HA API."""
|
||||
url = f"{HA_URL}/api/{endpoint}"
|
||||
response = requests.post(url, headers=HEADERS, json=data or {}, timeout=30)
|
||||
response.raise_for_status()
|
||||
return response.json() if response.text else {}
|
||||
|
||||
|
||||
def get_states():
|
||||
"""Get all entity states."""
|
||||
return api_get("states")
|
||||
|
||||
|
||||
def get_state(entity_id):
|
||||
"""Get state of a specific entity."""
|
||||
return api_get(f"states/{entity_id}")
|
||||
|
||||
|
||||
def get_services():
|
||||
"""Get all available services."""
|
||||
return api_get("services")
|
||||
|
||||
|
||||
def call_service(domain, service, entity_id=None, data=None):
|
||||
"""Call a Home Assistant service."""
|
||||
payload = data or {}
|
||||
if entity_id:
|
||||
payload["entity_id"] = entity_id
|
||||
return api_post(f"services/{domain}/{service}", payload)
|
||||
|
||||
|
||||
def list_entities(domain_filter=None, area_filter=None):
|
||||
"""List all entities, optionally filtered by domain or area."""
|
||||
states = get_states()
|
||||
entities = []
|
||||
|
||||
for state in states:
|
||||
entity_id = state["entity_id"]
|
||||
domain = entity_id.split(".")[0]
|
||||
|
||||
if domain_filter and domain != domain_filter:
|
||||
continue
|
||||
|
||||
entities.append({
|
||||
"entity_id": entity_id,
|
||||
"state": state["state"],
|
||||
"friendly_name": state["attributes"].get("friendly_name", entity_id),
|
||||
"domain": domain,
|
||||
})
|
||||
|
||||
# Sort by domain, then entity_id
|
||||
entities.sort(key=lambda x: (x["domain"], x["entity_id"]))
|
||||
return entities
|
||||
|
||||
|
||||
def turn_on(entity_id):
|
||||
"""Turn on an entity."""
|
||||
domain = entity_id.split(".")[0]
|
||||
return call_service(domain, "turn_on", entity_id)
|
||||
|
||||
|
||||
def turn_off(entity_id):
|
||||
"""Turn off an entity."""
|
||||
domain = entity_id.split(".")[0]
|
||||
return call_service(domain, "turn_off", entity_id)
|
||||
|
||||
|
||||
def toggle(entity_id):
|
||||
"""Toggle an entity."""
|
||||
domain = entity_id.split(".")[0]
|
||||
return call_service(domain, "toggle", entity_id)
|
||||
|
||||
|
||||
def set_value(entity_id, value):
|
||||
"""Set value for input entities (input_number, input_text, etc.)."""
|
||||
domain = entity_id.split(".")[0]
|
||||
|
||||
if domain == "input_number":
|
||||
return call_service(domain, "set_value", entity_id, {"value": float(value)})
|
||||
elif domain == "input_text":
|
||||
return call_service(domain, "set_value", entity_id, {"value": str(value)})
|
||||
elif domain == "input_boolean":
|
||||
if value.lower() in ("true", "on", "1", "yes"):
|
||||
return turn_on(entity_id)
|
||||
else:
|
||||
return turn_off(entity_id)
|
||||
elif domain == "input_select":
|
||||
return call_service(domain, "select_option", entity_id, {"option": str(value)})
|
||||
elif domain == "light":
|
||||
# Assume value is brightness percentage
|
||||
return call_service(domain, "turn_on", entity_id, {"brightness_pct": int(value)})
|
||||
elif domain == "climate":
|
||||
return call_service(domain, "set_temperature", entity_id, {"temperature": float(value)})
|
||||
elif domain == "cover":
|
||||
return call_service(domain, "set_cover_position", entity_id, {"position": int(value)})
|
||||
else:
|
||||
print(f"Warning: set_value not implemented for domain '{domain}'", file=sys.stderr)
|
||||
return {}
|
||||
|
||||
|
||||
def run_script(script_id):
|
||||
"""Run a script."""
|
||||
if not script_id.startswith("script."):
|
||||
script_id = f"script.{script_id}"
|
||||
return call_service("script", "turn_on", script_id)
|
||||
|
||||
|
||||
def run_scene(scene_id):
|
||||
"""Activate a scene."""
|
||||
if not scene_id.startswith("scene."):
|
||||
scene_id = f"scene.{scene_id}"
|
||||
return call_service("scene", "turn_on", scene_id)
|
||||
|
||||
|
||||
def send_notification(message, title=None, target="notify"):
|
||||
"""Send a notification."""
|
||||
data = {"message": message}
|
||||
if title:
|
||||
data["title"] = title
|
||||
return call_service("notify", target, data=data)
|
||||
|
||||
|
||||
def format_entities(entities, output_format="text"):
|
||||
"""Format entities for display."""
|
||||
if output_format == "json":
|
||||
return json.dumps(entities, indent=2)
|
||||
|
||||
if not entities:
|
||||
return "No entities found."
|
||||
|
||||
lines = []
|
||||
current_domain = None
|
||||
|
||||
for entity in entities:
|
||||
if entity["domain"] != current_domain:
|
||||
current_domain = entity["domain"]
|
||||
lines.append(f"\n## {current_domain}")
|
||||
|
||||
state = entity["state"]
|
||||
name = entity["friendly_name"]
|
||||
eid = entity["entity_id"]
|
||||
|
||||
# Color-code common states
|
||||
if state in ("on", "home", "open", "playing"):
|
||||
state_display = f"[ON] {state}"
|
||||
elif state in ("off", "away", "closed", "idle", "paused"):
|
||||
state_display = f"[--] {state}"
|
||||
elif state == "unavailable":
|
||||
state_display = "[??] unavailable"
|
||||
else:
|
||||
state_display = state
|
||||
|
||||
lines.append(f"- {name}: {state_display}")
|
||||
lines.append(f" `{eid}`")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def search_entities(query):
|
||||
"""Search entities by name or ID."""
|
||||
query = query.lower()
|
||||
states = get_states()
|
||||
matches = []
|
||||
|
||||
for state in states:
|
||||
entity_id = state["entity_id"]
|
||||
friendly_name = state["attributes"].get("friendly_name", "").lower()
|
||||
|
||||
if query in entity_id.lower() or query in friendly_name:
|
||||
matches.append({
|
||||
"entity_id": entity_id,
|
||||
"state": state["state"],
|
||||
"friendly_name": state["attributes"].get("friendly_name", entity_id),
|
||||
"domain": entity_id.split(".")[0],
|
||||
})
|
||||
|
||||
matches.sort(key=lambda x: (x["domain"], x["entity_id"]))
|
||||
return matches
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Control Home Assistant (ha-sofia)")
|
||||
subparsers = parser.add_subparsers(dest="command", help="Command to run")
|
||||
|
||||
# List command
|
||||
list_parser = subparsers.add_parser("list", help="List entities")
|
||||
list_parser.add_argument("--domain", "-d", help="Filter by domain (light, switch, sensor, etc.)")
|
||||
list_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# Search command
|
||||
search_parser = subparsers.add_parser("search", help="Search entities")
|
||||
search_parser.add_argument("query", help="Search query")
|
||||
search_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# State command
|
||||
state_parser = subparsers.add_parser("state", help="Get entity state")
|
||||
state_parser.add_argument("entity_id", help="Entity ID")
|
||||
state_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# On command
|
||||
on_parser = subparsers.add_parser("on", help="Turn on entity")
|
||||
on_parser.add_argument("entity_id", help="Entity ID")
|
||||
|
||||
# Off command
|
||||
off_parser = subparsers.add_parser("off", help="Turn off entity")
|
||||
off_parser.add_argument("entity_id", help="Entity ID")
|
||||
|
||||
# Toggle command
|
||||
toggle_parser = subparsers.add_parser("toggle", help="Toggle entity")
|
||||
toggle_parser.add_argument("entity_id", help="Entity ID")
|
||||
|
||||
# Set command
|
||||
set_parser = subparsers.add_parser("set", help="Set entity value")
|
||||
set_parser.add_argument("entity_id", help="Entity ID")
|
||||
set_parser.add_argument("value", help="Value to set")
|
||||
|
||||
# Script command
|
||||
script_parser = subparsers.add_parser("script", help="Run a script")
|
||||
script_parser.add_argument("script_id", help="Script ID (with or without 'script.' prefix)")
|
||||
|
||||
# Scene command
|
||||
scene_parser = subparsers.add_parser("scene", help="Activate a scene")
|
||||
scene_parser.add_argument("scene_id", help="Scene ID (with or without 'scene.' prefix)")
|
||||
|
||||
# Service command
|
||||
service_parser = subparsers.add_parser("service", help="Call a service")
|
||||
service_parser.add_argument("domain", help="Service domain")
|
||||
service_parser.add_argument("service", help="Service name")
|
||||
service_parser.add_argument("--entity", "-e", help="Entity ID")
|
||||
service_parser.add_argument("--data", "-d", help="JSON data")
|
||||
|
||||
# Services list command
|
||||
services_parser = subparsers.add_parser("services", help="List available services")
|
||||
services_parser.add_argument("--domain", "-d", help="Filter by domain")
|
||||
services_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# Notify command
|
||||
notify_parser = subparsers.add_parser("notify", help="Send notification")
|
||||
notify_parser.add_argument("message", help="Notification message")
|
||||
notify_parser.add_argument("--title", "-t", help="Notification title")
|
||||
notify_parser.add_argument("--target", default="notify", help="Notification target (default: notify)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.command:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
if args.command == "list":
|
||||
entities = list_entities(domain_filter=args.domain)
|
||||
output_format = "json" if args.json else "text"
|
||||
print(format_entities(entities, output_format))
|
||||
|
||||
elif args.command == "search":
|
||||
entities = search_entities(args.query)
|
||||
output_format = "json" if args.json else "text"
|
||||
print(format_entities(entities, output_format))
|
||||
|
||||
elif args.command == "state":
|
||||
state = get_state(args.entity_id)
|
||||
if args.json:
|
||||
print(json.dumps(state, indent=2))
|
||||
else:
|
||||
print(f"Entity: {state['entity_id']}")
|
||||
print(f"State: {state['state']}")
|
||||
print(f"Name: {state['attributes'].get('friendly_name', 'N/A')}")
|
||||
if state['attributes']:
|
||||
print("Attributes:")
|
||||
for key, value in state['attributes'].items():
|
||||
if key != 'friendly_name':
|
||||
print(f" {key}: {value}")
|
||||
|
||||
elif args.command == "on":
|
||||
turn_on(args.entity_id)
|
||||
print(f"Turned on: {args.entity_id}")
|
||||
|
||||
elif args.command == "off":
|
||||
turn_off(args.entity_id)
|
||||
print(f"Turned off: {args.entity_id}")
|
||||
|
||||
elif args.command == "toggle":
|
||||
toggle(args.entity_id)
|
||||
print(f"Toggled: {args.entity_id}")
|
||||
|
||||
elif args.command == "set":
|
||||
set_value(args.entity_id, args.value)
|
||||
print(f"Set {args.entity_id} to {args.value}")
|
||||
|
||||
elif args.command == "script":
|
||||
run_script(args.script_id)
|
||||
print(f"Ran script: {args.script_id}")
|
||||
|
||||
elif args.command == "scene":
|
||||
run_scene(args.scene_id)
|
||||
print(f"Activated scene: {args.scene_id}")
|
||||
|
||||
elif args.command == "service":
|
||||
data = json.loads(args.data) if args.data else None
|
||||
call_service(args.domain, args.service, args.entity, data)
|
||||
print(f"Called {args.domain}.{args.service}")
|
||||
|
||||
elif args.command == "services":
|
||||
services = get_services()
|
||||
if args.domain:
|
||||
services = [s for s in services if s["domain"] == args.domain]
|
||||
|
||||
if args.json:
|
||||
print(json.dumps(services, indent=2))
|
||||
else:
|
||||
for svc in services:
|
||||
print(f"\n## {svc['domain']}")
|
||||
for name, info in svc["services"].items():
|
||||
desc = info.get("description", "")
|
||||
print(f"- {name}: {desc[:60]}...")
|
||||
|
||||
elif args.command == "notify":
|
||||
send_notification(args.message, args.title, args.target)
|
||||
print(f"Sent notification: {args.message[:50]}...")
|
||||
|
||||
except requests.exceptions.HTTPError as e:
|
||||
print(f"HTTP Error: {e}", file=sys.stderr)
|
||||
print(f"Response: {e.response.text}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,373 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Home Assistant API Script
|
||||
Control and query Home Assistant entities.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from urllib.parse import urljoin
|
||||
|
||||
try:
|
||||
import requests
|
||||
except ImportError:
|
||||
print("ERROR: Required package not installed. Run:")
|
||||
print(" pip install requests")
|
||||
sys.exit(1)
|
||||
|
||||
# Configuration from environment variables
|
||||
HA_URL = os.environ.get("HOME_ASSISTANT_URL", "").rstrip("/")
|
||||
HA_TOKEN = os.environ.get("HOME_ASSISTANT_TOKEN")
|
||||
|
||||
if not HA_URL or not HA_TOKEN:
|
||||
print("ERROR: HOME_ASSISTANT_URL and HOME_ASSISTANT_TOKEN environment variables must be set.")
|
||||
print("These should be set when activating the Claude venv (~/.venvs/claude)")
|
||||
sys.exit(1)
|
||||
|
||||
HEADERS = {
|
||||
"Authorization": f"Bearer {HA_TOKEN}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
|
||||
def api_get(endpoint):
|
||||
"""Make GET request to HA API."""
|
||||
url = f"{HA_URL}/api/{endpoint}"
|
||||
response = requests.get(url, headers=HEADERS, timeout=30)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
|
||||
def api_post(endpoint, data=None):
|
||||
"""Make POST request to HA API."""
|
||||
url = f"{HA_URL}/api/{endpoint}"
|
||||
response = requests.post(url, headers=HEADERS, json=data or {}, timeout=30)
|
||||
response.raise_for_status()
|
||||
return response.json() if response.text else {}
|
||||
|
||||
|
||||
def get_states():
|
||||
"""Get all entity states."""
|
||||
return api_get("states")
|
||||
|
||||
|
||||
def get_state(entity_id):
|
||||
"""Get state of a specific entity."""
|
||||
return api_get(f"states/{entity_id}")
|
||||
|
||||
|
||||
def get_services():
|
||||
"""Get all available services."""
|
||||
return api_get("services")
|
||||
|
||||
|
||||
def call_service(domain, service, entity_id=None, data=None):
|
||||
"""Call a Home Assistant service."""
|
||||
payload = data or {}
|
||||
if entity_id:
|
||||
payload["entity_id"] = entity_id
|
||||
return api_post(f"services/{domain}/{service}", payload)
|
||||
|
||||
|
||||
def list_entities(domain_filter=None, area_filter=None):
|
||||
"""List all entities, optionally filtered by domain or area."""
|
||||
states = get_states()
|
||||
entities = []
|
||||
|
||||
for state in states:
|
||||
entity_id = state["entity_id"]
|
||||
domain = entity_id.split(".")[0]
|
||||
|
||||
if domain_filter and domain != domain_filter:
|
||||
continue
|
||||
|
||||
entities.append({
|
||||
"entity_id": entity_id,
|
||||
"state": state["state"],
|
||||
"friendly_name": state["attributes"].get("friendly_name", entity_id),
|
||||
"domain": domain,
|
||||
})
|
||||
|
||||
# Sort by domain, then entity_id
|
||||
entities.sort(key=lambda x: (x["domain"], x["entity_id"]))
|
||||
return entities
|
||||
|
||||
|
||||
def turn_on(entity_id):
|
||||
"""Turn on an entity."""
|
||||
domain = entity_id.split(".")[0]
|
||||
return call_service(domain, "turn_on", entity_id)
|
||||
|
||||
|
||||
def turn_off(entity_id):
|
||||
"""Turn off an entity."""
|
||||
domain = entity_id.split(".")[0]
|
||||
return call_service(domain, "turn_off", entity_id)
|
||||
|
||||
|
||||
def toggle(entity_id):
|
||||
"""Toggle an entity."""
|
||||
domain = entity_id.split(".")[0]
|
||||
return call_service(domain, "toggle", entity_id)
|
||||
|
||||
|
||||
def set_value(entity_id, value):
|
||||
"""Set value for input entities (input_number, input_text, etc.)."""
|
||||
domain = entity_id.split(".")[0]
|
||||
|
||||
if domain == "input_number":
|
||||
return call_service(domain, "set_value", entity_id, {"value": float(value)})
|
||||
elif domain == "input_text":
|
||||
return call_service(domain, "set_value", entity_id, {"value": str(value)})
|
||||
elif domain == "input_boolean":
|
||||
if value.lower() in ("true", "on", "1", "yes"):
|
||||
return turn_on(entity_id)
|
||||
else:
|
||||
return turn_off(entity_id)
|
||||
elif domain == "input_select":
|
||||
return call_service(domain, "select_option", entity_id, {"option": str(value)})
|
||||
elif domain == "light":
|
||||
# Assume value is brightness percentage
|
||||
return call_service(domain, "turn_on", entity_id, {"brightness_pct": int(value)})
|
||||
elif domain == "climate":
|
||||
return call_service(domain, "set_temperature", entity_id, {"temperature": float(value)})
|
||||
elif domain == "cover":
|
||||
return call_service(domain, "set_cover_position", entity_id, {"position": int(value)})
|
||||
else:
|
||||
print(f"Warning: set_value not implemented for domain '{domain}'", file=sys.stderr)
|
||||
return {}
|
||||
|
||||
|
||||
def run_script(script_id):
|
||||
"""Run a script."""
|
||||
if not script_id.startswith("script."):
|
||||
script_id = f"script.{script_id}"
|
||||
return call_service("script", "turn_on", script_id)
|
||||
|
||||
|
||||
def run_scene(scene_id):
|
||||
"""Activate a scene."""
|
||||
if not scene_id.startswith("scene."):
|
||||
scene_id = f"scene.{scene_id}"
|
||||
return call_service("scene", "turn_on", scene_id)
|
||||
|
||||
|
||||
def send_notification(message, title=None, target="notify"):
|
||||
"""Send a notification."""
|
||||
data = {"message": message}
|
||||
if title:
|
||||
data["title"] = title
|
||||
return call_service("notify", target, data=data)
|
||||
|
||||
|
||||
def format_entities(entities, output_format="text"):
|
||||
"""Format entities for display."""
|
||||
if output_format == "json":
|
||||
return json.dumps(entities, indent=2)
|
||||
|
||||
if not entities:
|
||||
return "No entities found."
|
||||
|
||||
lines = []
|
||||
current_domain = None
|
||||
|
||||
for entity in entities:
|
||||
if entity["domain"] != current_domain:
|
||||
current_domain = entity["domain"]
|
||||
lines.append(f"\n## {current_domain}")
|
||||
|
||||
state = entity["state"]
|
||||
name = entity["friendly_name"]
|
||||
eid = entity["entity_id"]
|
||||
|
||||
# Color-code common states
|
||||
if state in ("on", "home", "open", "playing"):
|
||||
state_display = f"[ON] {state}"
|
||||
elif state in ("off", "away", "closed", "idle", "paused"):
|
||||
state_display = f"[--] {state}"
|
||||
elif state == "unavailable":
|
||||
state_display = "[??] unavailable"
|
||||
else:
|
||||
state_display = state
|
||||
|
||||
lines.append(f"- {name}: {state_display}")
|
||||
lines.append(f" `{eid}`")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def search_entities(query):
|
||||
"""Search entities by name or ID."""
|
||||
query = query.lower()
|
||||
states = get_states()
|
||||
matches = []
|
||||
|
||||
for state in states:
|
||||
entity_id = state["entity_id"]
|
||||
friendly_name = state["attributes"].get("friendly_name", "").lower()
|
||||
|
||||
if query in entity_id.lower() or query in friendly_name:
|
||||
matches.append({
|
||||
"entity_id": entity_id,
|
||||
"state": state["state"],
|
||||
"friendly_name": state["attributes"].get("friendly_name", entity_id),
|
||||
"domain": entity_id.split(".")[0],
|
||||
})
|
||||
|
||||
matches.sort(key=lambda x: (x["domain"], x["entity_id"]))
|
||||
return matches
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Control Home Assistant")
|
||||
subparsers = parser.add_subparsers(dest="command", help="Command to run")
|
||||
|
||||
# List command
|
||||
list_parser = subparsers.add_parser("list", help="List entities")
|
||||
list_parser.add_argument("--domain", "-d", help="Filter by domain (light, switch, sensor, etc.)")
|
||||
list_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# Search command
|
||||
search_parser = subparsers.add_parser("search", help="Search entities")
|
||||
search_parser.add_argument("query", help="Search query")
|
||||
search_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# State command
|
||||
state_parser = subparsers.add_parser("state", help="Get entity state")
|
||||
state_parser.add_argument("entity_id", help="Entity ID")
|
||||
state_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# On command
|
||||
on_parser = subparsers.add_parser("on", help="Turn on entity")
|
||||
on_parser.add_argument("entity_id", help="Entity ID")
|
||||
|
||||
# Off command
|
||||
off_parser = subparsers.add_parser("off", help="Turn off entity")
|
||||
off_parser.add_argument("entity_id", help="Entity ID")
|
||||
|
||||
# Toggle command
|
||||
toggle_parser = subparsers.add_parser("toggle", help="Toggle entity")
|
||||
toggle_parser.add_argument("entity_id", help="Entity ID")
|
||||
|
||||
# Set command
|
||||
set_parser = subparsers.add_parser("set", help="Set entity value")
|
||||
set_parser.add_argument("entity_id", help="Entity ID")
|
||||
set_parser.add_argument("value", help="Value to set")
|
||||
|
||||
# Script command
|
||||
script_parser = subparsers.add_parser("script", help="Run a script")
|
||||
script_parser.add_argument("script_id", help="Script ID (with or without 'script.' prefix)")
|
||||
|
||||
# Scene command
|
||||
scene_parser = subparsers.add_parser("scene", help="Activate a scene")
|
||||
scene_parser.add_argument("scene_id", help="Scene ID (with or without 'scene.' prefix)")
|
||||
|
||||
# Service command
|
||||
service_parser = subparsers.add_parser("service", help="Call a service")
|
||||
service_parser.add_argument("domain", help="Service domain")
|
||||
service_parser.add_argument("service", help="Service name")
|
||||
service_parser.add_argument("--entity", "-e", help="Entity ID")
|
||||
service_parser.add_argument("--data", "-d", help="JSON data")
|
||||
|
||||
# Services list command
|
||||
services_parser = subparsers.add_parser("services", help="List available services")
|
||||
services_parser.add_argument("--domain", "-d", help="Filter by domain")
|
||||
services_parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
|
||||
# Notify command
|
||||
notify_parser = subparsers.add_parser("notify", help="Send notification")
|
||||
notify_parser.add_argument("message", help="Notification message")
|
||||
notify_parser.add_argument("--title", "-t", help="Notification title")
|
||||
notify_parser.add_argument("--target", default="notify", help="Notification target (default: notify)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.command:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
if args.command == "list":
|
||||
entities = list_entities(domain_filter=args.domain)
|
||||
output_format = "json" if args.json else "text"
|
||||
print(format_entities(entities, output_format))
|
||||
|
||||
elif args.command == "search":
|
||||
entities = search_entities(args.query)
|
||||
output_format = "json" if args.json else "text"
|
||||
print(format_entities(entities, output_format))
|
||||
|
||||
elif args.command == "state":
|
||||
state = get_state(args.entity_id)
|
||||
if args.json:
|
||||
print(json.dumps(state, indent=2))
|
||||
else:
|
||||
print(f"Entity: {state['entity_id']}")
|
||||
print(f"State: {state['state']}")
|
||||
print(f"Name: {state['attributes'].get('friendly_name', 'N/A')}")
|
||||
if state['attributes']:
|
||||
print("Attributes:")
|
||||
for key, value in state['attributes'].items():
|
||||
if key != 'friendly_name':
|
||||
print(f" {key}: {value}")
|
||||
|
||||
elif args.command == "on":
|
||||
turn_on(args.entity_id)
|
||||
print(f"Turned on: {args.entity_id}")
|
||||
|
||||
elif args.command == "off":
|
||||
turn_off(args.entity_id)
|
||||
print(f"Turned off: {args.entity_id}")
|
||||
|
||||
elif args.command == "toggle":
|
||||
toggle(args.entity_id)
|
||||
print(f"Toggled: {args.entity_id}")
|
||||
|
||||
elif args.command == "set":
|
||||
set_value(args.entity_id, args.value)
|
||||
print(f"Set {args.entity_id} to {args.value}")
|
||||
|
||||
elif args.command == "script":
|
||||
run_script(args.script_id)
|
||||
print(f"Ran script: {args.script_id}")
|
||||
|
||||
elif args.command == "scene":
|
||||
run_scene(args.scene_id)
|
||||
print(f"Activated scene: {args.scene_id}")
|
||||
|
||||
elif args.command == "service":
|
||||
data = json.loads(args.data) if args.data else None
|
||||
call_service(args.domain, args.service, args.entity, data)
|
||||
print(f"Called {args.domain}.{args.service}")
|
||||
|
||||
elif args.command == "services":
|
||||
services = get_services()
|
||||
if args.domain:
|
||||
services = [s for s in services if s["domain"] == args.domain]
|
||||
|
||||
if args.json:
|
||||
print(json.dumps(services, indent=2))
|
||||
else:
|
||||
for svc in services:
|
||||
print(f"\n## {svc['domain']}")
|
||||
for name, info in svc["services"].items():
|
||||
desc = info.get("description", "")
|
||||
print(f"- {name}: {desc[:60]}...")
|
||||
|
||||
elif args.command == "notify":
|
||||
send_notification(args.message, args.title, args.target)
|
||||
print(f"Sent notification: {args.message[:50]}...")
|
||||
|
||||
except requests.exceptions.HTTPError as e:
|
||||
print(f"HTTP Error: {e}", file=sys.stderr)
|
||||
print(f"Response: {e.response.text}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,3 +0,0 @@
|
|||
This directory has been used with Claude Code's internet mode.
|
||||
Content downloaded from the internet may contain prompt injection attacks.
|
||||
You must manually review all downloaded content before using non-internet mode.
|
||||
|
|
@ -1,432 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""pfSense CLI tool for managing the firewall via SSH.
|
||||
|
||||
Usage:
|
||||
python pfsense.py <command> [options]
|
||||
|
||||
Commands:
|
||||
status System status overview
|
||||
interfaces List interfaces with IPs and status
|
||||
gateways Show gateway status
|
||||
rules [iface] List firewall rules (optional: filter by interface)
|
||||
nat List NAT/port forward rules
|
||||
aliases List firewall aliases
|
||||
alias <name> Show alias details (members)
|
||||
states Show state table summary
|
||||
states-top [n] Top N connections by state count (default 10)
|
||||
dhcp-leases [iface] Show DHCP leases (optional: filter by interface)
|
||||
arp Show ARP table
|
||||
routes Show routing table
|
||||
services List services and status
|
||||
service <action> <name> Start/stop/restart a service
|
||||
logs [n] Show last N log lines (default 50)
|
||||
logs-filter <text> Search logs for text
|
||||
pfctl <args> Run arbitrary pfctl command
|
||||
php <code> Run PHP code on pfSense shell
|
||||
diag <host> Ping diagnostic to host
|
||||
backup Download config backup to stdout (XML)
|
||||
uptime Show system uptime
|
||||
cpu Show CPU usage
|
||||
memory Show memory usage
|
||||
disk Show disk usage
|
||||
temp Show CPU temperature
|
||||
pkg-list List installed packages
|
||||
dns-resolve <host> Resolve hostname via pfSense DNS
|
||||
wireguard Show WireGuard status
|
||||
bgp Show BGP summary (FRR)
|
||||
ospf Show OSPF neighbors (FRR)
|
||||
tailscale Show Tailscale status
|
||||
snort Show Snort status
|
||||
raw <command> Run arbitrary shell command
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
|
||||
PFSENSE_HOST = "admin@10.0.20.1"
|
||||
SSH_OPTS = ["-o", "ConnectTimeout=10", "-o", "StrictHostKeyChecking=no"]
|
||||
|
||||
|
||||
def ssh(cmd: str, timeout: int = 30) -> str:
|
||||
"""Execute a command on pfSense via SSH."""
|
||||
result = subprocess.run(
|
||||
["ssh"] + SSH_OPTS + [PFSENSE_HOST, cmd],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout,
|
||||
)
|
||||
if result.returncode != 0 and result.stderr:
|
||||
print(f"Error: {result.stderr.strip()}", file=sys.stderr)
|
||||
return result.stdout.strip()
|
||||
|
||||
|
||||
def cmd_status(_args):
|
||||
print(ssh("""
|
||||
echo "=== System ==="
|
||||
uname -sr
|
||||
echo "Version: $(cat /etc/version)"
|
||||
uptime
|
||||
echo ""
|
||||
echo "=== CPU ==="
|
||||
sysctl -n hw.model
|
||||
echo "Load: $(sysctl -n vm.loadavg)"
|
||||
echo ""
|
||||
echo "=== Memory ==="
|
||||
php -r '
|
||||
$mem = @file_get_contents("/proc/meminfo") ?: "";
|
||||
$total = (int)shell_exec("sysctl -n hw.physmem") / 1024 / 1024;
|
||||
$free_pages = (int)shell_exec("sysctl -n vm.stats.vm.v_free_count");
|
||||
$page_size = (int)shell_exec("sysctl -n hw.pagesize");
|
||||
$free = $free_pages * $page_size / 1024 / 1024;
|
||||
printf("Total: %.0f MB, Free: %.0f MB, Used: %.0f MB (%.1f%%)\n",
|
||||
$total, $free, $total - $free, ($total - $free) / $total * 100);
|
||||
'
|
||||
echo ""
|
||||
echo "=== Disk ==="
|
||||
df -h / /var /tmp 2>/dev/null | grep -v "^Filesystem" | awk '{print $6 ": " $3 "/" $1 " (" $5 " used)"}'
|
||||
echo ""
|
||||
echo "=== States ==="
|
||||
pfctl -si 2>/dev/null | grep "current entries"
|
||||
echo ""
|
||||
echo "=== Temperature ==="
|
||||
sysctl -a 2>/dev/null | grep temperature | head -5
|
||||
"""))
|
||||
|
||||
|
||||
def cmd_interfaces(_args):
|
||||
print(ssh("""
|
||||
php -r '
|
||||
require_once("config.inc");
|
||||
require_once("interfaces.inc");
|
||||
$cfg = parse_config(true);
|
||||
foreach($cfg["interfaces"] as $k => $v) {
|
||||
$if = $v["if"] ?? "?";
|
||||
$descr = $v["descr"] ?? $k;
|
||||
$ip = $v["ipaddr"] ?? "dhcp";
|
||||
$subnet = $v["subnet"] ?? "";
|
||||
$enabled = isset($v["enable"]) || $k == "wan" || $k == "lan" ? "UP" : "DOWN";
|
||||
$gw = $v["gateway"] ?? "-";
|
||||
printf("%-8s %-20s %-10s %-18s gw:%-10s %s\n", $k, $descr, $if, $ip . ($subnet ? "/" . $subnet : ""), $gw, $enabled);
|
||||
}
|
||||
'
|
||||
"""))
|
||||
|
||||
|
||||
def cmd_gateways(_args):
|
||||
print(ssh("pfSsh.php playback gatewaystatus"))
|
||||
|
||||
|
||||
def cmd_rules(args):
|
||||
iface_filter = args.interface if hasattr(args, 'interface') and args.interface else ""
|
||||
if iface_filter:
|
||||
print(ssh(f"pfctl -sr 2>/dev/null | grep -i '{iface_filter}'"))
|
||||
else:
|
||||
print(ssh("pfctl -sr 2>/dev/null"))
|
||||
|
||||
|
||||
def cmd_nat(_args):
|
||||
print(ssh("pfctl -sn 2>/dev/null"))
|
||||
|
||||
|
||||
def cmd_aliases(_args):
|
||||
print(ssh("pfctl -sT 2>/dev/null"))
|
||||
|
||||
|
||||
def cmd_alias(args):
|
||||
print(ssh(f"pfctl -t {args.name} -T show 2>/dev/null"))
|
||||
|
||||
|
||||
def cmd_states(_args):
|
||||
print(ssh("pfctl -si 2>/dev/null"))
|
||||
|
||||
|
||||
def cmd_states_top(args):
|
||||
n = args.n if hasattr(args, 'n') and args.n else 10
|
||||
print(ssh(f"pfctl -ss 2>/dev/null | awk '{{print $3}}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -{n}"))
|
||||
|
||||
|
||||
def cmd_dhcp_leases(args):
|
||||
iface = args.interface if hasattr(args, 'interface') and args.interface else ""
|
||||
filter_clause = f'if($l["if"] == "{iface}")' if iface else ""
|
||||
print(ssh(f"""
|
||||
php -r '
|
||||
require_once("config.inc");
|
||||
require_once("interfaces.inc");
|
||||
$leases = system_get_dhcpleases();
|
||||
foreach($leases["lease"] as $l) {{
|
||||
{filter_clause}
|
||||
printf("%-16s %-18s %-8s %-15s %-10s %s\n",
|
||||
$l["ip"], $l["mac"] ?? "-", $l["act"] ?? "-",
|
||||
$l["hostname"] ?? "-", $l["if"] ?? "-",
|
||||
$l["online"] ?? "-");
|
||||
}}
|
||||
'
|
||||
"""))
|
||||
|
||||
|
||||
def cmd_arp(_args):
|
||||
print(ssh("arp -an"))
|
||||
|
||||
|
||||
def cmd_routes(_args):
|
||||
print(ssh("netstat -rn"))
|
||||
|
||||
|
||||
def cmd_services(_args):
|
||||
print(ssh("""
|
||||
php -r '
|
||||
require_once("config.inc");
|
||||
require_once("service-utils.inc");
|
||||
$svcs = get_services();
|
||||
foreach($svcs as $s) {
|
||||
$status = get_service_status($s) ? "RUNNING" : "STOPPED";
|
||||
printf("%-30s %s\n", $s["name"], $status);
|
||||
}
|
||||
'
|
||||
"""))
|
||||
|
||||
|
||||
def cmd_service(args):
|
||||
action = args.action
|
||||
name = args.name
|
||||
if action not in ("start", "stop", "restart"):
|
||||
print(f"Invalid action: {action}. Use start/stop/restart.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
print(ssh(f"pfSsh.php playback svc {action} {name}"))
|
||||
|
||||
|
||||
def cmd_logs(args):
|
||||
n = args.n if hasattr(args, 'n') and args.n else 50
|
||||
print(ssh(f"clog -f /var/log/filter.log 2>/dev/null | tail -{n}"))
|
||||
|
||||
|
||||
def cmd_logs_filter(args):
|
||||
print(ssh(f"clog -f /var/log/filter.log 2>/dev/null | grep -i '{args.text}'"))
|
||||
|
||||
|
||||
def cmd_pfctl(args):
|
||||
print(ssh(f"pfctl {args.args}"))
|
||||
|
||||
|
||||
def cmd_php(args):
|
||||
print(ssh(f"php -r '{args.code}'"))
|
||||
|
||||
|
||||
def cmd_diag(args):
|
||||
print(ssh(f"ping -c 4 {args.host}"))
|
||||
|
||||
|
||||
def cmd_backup(_args):
|
||||
print(ssh("cat /cf/conf/config.xml"))
|
||||
|
||||
|
||||
def cmd_uptime(_args):
|
||||
print(ssh("uptime"))
|
||||
|
||||
|
||||
def cmd_cpu(_args):
|
||||
print(ssh("""
|
||||
echo "Load: $(sysctl -n vm.loadavg)"
|
||||
echo "Model: $(sysctl -n hw.model)"
|
||||
echo "Cores: $(sysctl -n hw.ncpu)"
|
||||
top -b -d1 2>/dev/null | head -5 || vmstat 1 2 | tail -1
|
||||
"""))
|
||||
|
||||
|
||||
def cmd_memory(_args):
|
||||
print(ssh("""
|
||||
php -r '
|
||||
$total = (int)shell_exec("sysctl -n hw.physmem") / 1024 / 1024;
|
||||
$free_pages = (int)shell_exec("sysctl -n vm.stats.vm.v_free_count");
|
||||
$inactive_pages = (int)shell_exec("sysctl -n vm.stats.vm.v_inactive_count");
|
||||
$cache_pages = (int)shell_exec("sysctl -n vm.stats.vm.v_cache_count");
|
||||
$page_size = (int)shell_exec("sysctl -n hw.pagesize");
|
||||
$free = $free_pages * $page_size / 1024 / 1024;
|
||||
$inactive = $inactive_pages * $page_size / 1024 / 1024;
|
||||
$cache = $cache_pages * $page_size / 1024 / 1024;
|
||||
$used = $total - $free - $inactive - $cache;
|
||||
printf("Total: %.0f MB\n", $total);
|
||||
printf("Used: %.0f MB (%.1f%%)\n", $used, $used / $total * 100);
|
||||
printf("Free: %.0f MB\n", $free);
|
||||
printf("Inactive: %.0f MB\n", $inactive);
|
||||
printf("Cache: %.0f MB\n", $cache);
|
||||
'
|
||||
"""))
|
||||
|
||||
|
||||
def cmd_disk(_args):
|
||||
print(ssh("df -h"))
|
||||
|
||||
|
||||
def cmd_temp(_args):
|
||||
print(ssh("sysctl -a 2>/dev/null | grep -i temp"))
|
||||
|
||||
|
||||
def cmd_pkg_list(_args):
|
||||
print(ssh("pfSsh.php playback listpkg"))
|
||||
|
||||
|
||||
def cmd_dns_resolve(args):
|
||||
print(ssh(f"drill {args.host} @127.0.0.1 2>/dev/null || host {args.host} 127.0.0.1 2>/dev/null || nslookup {args.host} 127.0.0.1"))
|
||||
|
||||
|
||||
def cmd_wireguard(_args):
|
||||
print(ssh("wg show 2>/dev/null || echo 'WireGuard not active or wg command not found'"))
|
||||
|
||||
|
||||
def cmd_bgp(_args):
|
||||
print(ssh("/usr/local/bin/vtysh -c 'show bgp summary' 2>/dev/null || echo 'FRR/BGP not available'"))
|
||||
|
||||
|
||||
def cmd_ospf(_args):
|
||||
print(ssh("/usr/local/bin/vtysh -c 'show ip ospf neighbor' 2>/dev/null || echo 'FRR/OSPF not available'"))
|
||||
|
||||
|
||||
def cmd_tailscale(_args):
|
||||
print(ssh("tailscale status 2>/dev/null || echo 'Tailscale not available'"))
|
||||
|
||||
|
||||
def cmd_snort(_args):
|
||||
print(ssh("""
|
||||
php -r '
|
||||
require_once("config.inc");
|
||||
require_once("service-utils.inc");
|
||||
$svcs = get_services();
|
||||
foreach($svcs as $s) {
|
||||
if(stripos($s["name"], "snort") !== false) {
|
||||
$status = get_service_status($s) ? "RUNNING" : "STOPPED";
|
||||
printf("%-30s %s\n", $s["name"], $status);
|
||||
}
|
||||
}
|
||||
'
|
||||
echo "---Alerts (last 20)---"
|
||||
cat /var/log/snort/snort_*/alert 2>/dev/null | tail -20 || echo "No alert logs found"
|
||||
"""))
|
||||
|
||||
|
||||
def cmd_raw(args):
|
||||
print(ssh(args.command))
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="pfSense management via SSH")
|
||||
sub = parser.add_subparsers(dest="command", help="Command to run")
|
||||
|
||||
sub.add_parser("status", help="System status overview")
|
||||
sub.add_parser("interfaces", help="List interfaces")
|
||||
sub.add_parser("gateways", help="Show gateway status")
|
||||
|
||||
p = sub.add_parser("rules", help="List firewall rules")
|
||||
p.add_argument("interface", nargs="?", default="", help="Filter by interface")
|
||||
|
||||
sub.add_parser("nat", help="List NAT rules")
|
||||
sub.add_parser("aliases", help="List aliases")
|
||||
|
||||
p = sub.add_parser("alias", help="Show alias members")
|
||||
p.add_argument("name", help="Alias name")
|
||||
|
||||
sub.add_parser("states", help="State table summary")
|
||||
|
||||
p = sub.add_parser("states-top", help="Top connections by state count")
|
||||
p.add_argument("n", nargs="?", type=int, default=10)
|
||||
|
||||
p = sub.add_parser("dhcp-leases", help="Show DHCP leases")
|
||||
p.add_argument("interface", nargs="?", default="", help="Filter by interface")
|
||||
|
||||
sub.add_parser("arp", help="ARP table")
|
||||
sub.add_parser("routes", help="Routing table")
|
||||
sub.add_parser("services", help="List services")
|
||||
|
||||
p = sub.add_parser("service", help="Control a service")
|
||||
p.add_argument("action", choices=["start", "stop", "restart"])
|
||||
p.add_argument("name", help="Service name")
|
||||
|
||||
p = sub.add_parser("logs", help="Show firewall logs")
|
||||
p.add_argument("n", nargs="?", type=int, default=50)
|
||||
|
||||
p = sub.add_parser("logs-filter", help="Search logs")
|
||||
p.add_argument("text", help="Text to search for")
|
||||
|
||||
p = sub.add_parser("pfctl", help="Run pfctl command")
|
||||
p.add_argument("args", help="pfctl arguments")
|
||||
|
||||
p = sub.add_parser("php", help="Run PHP code")
|
||||
p.add_argument("code", help="PHP code to execute")
|
||||
|
||||
p = sub.add_parser("diag", help="Ping diagnostic")
|
||||
p.add_argument("host", help="Host to ping")
|
||||
|
||||
sub.add_parser("backup", help="Download config backup (XML)")
|
||||
sub.add_parser("uptime", help="System uptime")
|
||||
sub.add_parser("cpu", help="CPU usage")
|
||||
sub.add_parser("memory", help="Memory usage")
|
||||
sub.add_parser("disk", help="Disk usage")
|
||||
sub.add_parser("temp", help="CPU temperature")
|
||||
sub.add_parser("pkg-list", help="List packages")
|
||||
|
||||
p = sub.add_parser("dns-resolve", help="Resolve hostname")
|
||||
p.add_argument("host", help="Hostname to resolve")
|
||||
|
||||
sub.add_parser("wireguard", help="WireGuard status")
|
||||
sub.add_parser("bgp", help="BGP summary")
|
||||
sub.add_parser("ospf", help="OSPF neighbors")
|
||||
sub.add_parser("tailscale", help="Tailscale status")
|
||||
sub.add_parser("snort", help="Snort status")
|
||||
|
||||
p = sub.add_parser("raw", help="Run arbitrary command")
|
||||
p.add_argument("command", help="Command to run")
|
||||
|
||||
args = parser.parse_args()
|
||||
if not args.command:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
cmd_map = {
|
||||
"status": cmd_status,
|
||||
"interfaces": cmd_interfaces,
|
||||
"gateways": cmd_gateways,
|
||||
"rules": cmd_rules,
|
||||
"nat": cmd_nat,
|
||||
"aliases": cmd_aliases,
|
||||
"alias": cmd_alias,
|
||||
"states": cmd_states,
|
||||
"states-top": cmd_states_top,
|
||||
"dhcp-leases": cmd_dhcp_leases,
|
||||
"arp": cmd_arp,
|
||||
"routes": cmd_routes,
|
||||
"services": cmd_services,
|
||||
"service": cmd_service,
|
||||
"logs": cmd_logs,
|
||||
"logs-filter": cmd_logs_filter,
|
||||
"pfctl": cmd_pfctl,
|
||||
"php": cmd_php,
|
||||
"diag": cmd_diag,
|
||||
"backup": cmd_backup,
|
||||
"uptime": cmd_uptime,
|
||||
"cpu": cmd_cpu,
|
||||
"memory": cmd_memory,
|
||||
"disk": cmd_disk,
|
||||
"temp": cmd_temp,
|
||||
"pkg-list": cmd_pkg_list,
|
||||
"dns-resolve": cmd_dns_resolve,
|
||||
"wireguard": cmd_wireguard,
|
||||
"bgp": cmd_bgp,
|
||||
"ospf": cmd_ospf,
|
||||
"tailscale": cmd_tailscale,
|
||||
"snort": cmd_snort,
|
||||
"raw": cmd_raw,
|
||||
}
|
||||
|
||||
func = cmd_map.get(args.command)
|
||||
if func:
|
||||
func(args)
|
||||
else:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,203 +0,0 @@
|
|||
# Authentik Current State
|
||||
|
||||
> Snapshot of applications, groups, users, and flows. Use `authentik` skill for management tasks.
|
||||
|
||||
## Applications (11)
|
||||
| Application | Provider Type | Auth Flow |
|
||||
|-------------|--------------|-----------|
|
||||
| Cloudflare Access | OAuth2/OIDC | explicit consent |
|
||||
| Domain wide catch all | Proxy (forward auth) | implicit consent |
|
||||
| Forgejo | OAuth2/OIDC | explicit consent |
|
||||
| Grafana | OAuth2/OIDC | implicit consent |
|
||||
| Headscale | OAuth2/OIDC | explicit consent |
|
||||
| Immich | OAuth2/OIDC | explicit consent |
|
||||
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
|
||||
| Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
|
||||
| linkwarden | OAuth2/OIDC | explicit consent |
|
||||
| wrongmove | OAuth2/OIDC | implicit consent |
|
||||
|
||||
> **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
|
||||
> confidential client `k8s-dashboard`, built for seamless dashboard SSO via
|
||||
> oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
|
||||
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12), so the dashboard runs
|
||||
> on forward-auth + token-paste instead and oauth2-proxy is unwired. Kept for a
|
||||
> future SSO retry once apiserver OIDC is fixed.
|
||||
>
|
||||
> **admin-services-restriction** policy (TF-managed in
|
||||
> `stacks/authentik/admin-services-restriction.tf`, adopted 2026-06-04): gates the
|
||||
> 15 admin-only hostnames to `Home Server Admins`, with a carve-out admitting the
|
||||
> `kubernetes-*` RBAC groups to `k8s.viktorbarzin.me` (dashboard login page).
|
||||
|
||||
## Groups (9)
|
||||
| Group | Parent | Superuser | Purpose |
|
||||
|-------|--------|-----------|---------|
|
||||
| Allow Login Users | -- | No | Parent group for login-permitted users |
|
||||
| authentik Admins | -- | Yes | Full admin access |
|
||||
| Headscale Users | Allow Login Users | No | VPN access |
|
||||
| Home Server Admins | Allow Login Users | No | Server admin access |
|
||||
| Wrongmove Users | Allow Login Users | No | Real-estate app access |
|
||||
| kubernetes-admins | -- | No | K8s cluster-admin RBAC |
|
||||
| kubernetes-power-users | -- | No | K8s power-user RBAC |
|
||||
| kubernetes-namespace-owners | -- | No | K8s namespace-owner RBAC |
|
||||
| Task Submitters | -- | No | Task submission access |
|
||||
|
||||
## Users (8 real)
|
||||
| Username | Name | Type | Groups |
|
||||
|----------|------|------|--------|
|
||||
| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
|
||||
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
|
||||
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
|
||||
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
|
||||
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users, kubernetes-namespace-owners, sops-vabbit81 |
|
||||
| valentinakolevabarzina@gmail.com | Valentina | internal | Headscale Users |
|
||||
| anca.r.cristian10@gmail.com | -- | internal | Wrongmove Users |
|
||||
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
|
||||
|
||||
## Login Sources
|
||||
- **Google** (OAuth) -- user matching by identifier
|
||||
- **GitHub** (OAuth) -- user matching by email_link
|
||||
- **Facebook** (OAuth) -- user matching by email_link
|
||||
- All sources use `invitation-enrollment` as enrollment flow (new users require invitation)
|
||||
|
||||
## Authorization Flows
|
||||
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
|
||||
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
|
||||
|
||||
## Invitation Enrollment Flow
|
||||
Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
|
||||
|
||||
New users can only sign up via invitation link. Admins generate single-use invite links.
|
||||
|
||||
### Stages (in order)
|
||||
| Order | Stage | Type | Purpose |
|
||||
|-------|-------|------|---------|
|
||||
| 10 | invitation-validation | Invitation | Validates `?itoken=` parameter, blocks without valid token |
|
||||
| 20 | enrollment-identification | Identification | Shows social login (Google/GitHub/Facebook) + passkey |
|
||||
| 30 | enrollment-prompt | Prompt | Collects name and email (pre-filled from social login) |
|
||||
| 40 | enrollment-user-write | User Write | Creates user in `Allow Login Users` group |
|
||||
| 50 | enrollment-login | User Login | Auto-login after signup (policy: `invitation-group-assignment` adds user to target group from invitation `fixed_data.group`) |
|
||||
|
||||
### Invitation Management
|
||||
Script: `.claude/scripts/authentik-invite.sh`
|
||||
|
||||
```bash
|
||||
# Create invitation (single-use, no expiry)
|
||||
./authentik-invite.sh create "Headscale Users"
|
||||
|
||||
# Create invitation with expiry
|
||||
./authentik-invite.sh create "Wrongmove Users" --days 7
|
||||
|
||||
# Add user to group after enrollment
|
||||
./authentik-invite.sh assign <username> "Headscale Users"
|
||||
|
||||
# List pending invitations
|
||||
./authentik-invite.sh list
|
||||
```
|
||||
|
||||
Invited users sign up via social login (Google/GitHub/Facebook) or passkey. No username/password enrollment.
|
||||
The target group (e.g. "Headscale Users") is auto-assigned on enrollment via the `invitation-group-assignment` expression policy. The `assign` command is available for manual post-enrollment group changes.
|
||||
|
||||
## Cleanup Log (2026-03-13)
|
||||
### Deleted Flows
|
||||
- `enrollment-inviation` (typo) -- previous invitation attempt
|
||||
- `headscale-authentication` -- not used by any provider
|
||||
- `headscale-authorization` -- not used by any provider
|
||||
- `default-enrollment-flow` -- password-based, unused
|
||||
- `oauth-enrollment` -- replaced by invitation-enrollment
|
||||
|
||||
### Deleted Stages
|
||||
- `enrollment-invitation`, `enrollment-invitation-write` (from old invitation flow)
|
||||
- `invitation` (unbound)
|
||||
- `default-enrollment-prompt-first`, `default-enrollment-prompt-second` (from default enrollment)
|
||||
- `default-enrollment-user-write`, `default-enrollment-email-verification`, `default-enrollment-user-login`
|
||||
|
||||
### Deleted Groups
|
||||
- `authentik Read-only` -- 0 users, unused role
|
||||
|
||||
### Deleted Policies
|
||||
- `map github username to email` -- unbound
|
||||
- `Map Google Attributes` -- unbound
|
||||
|
||||
### Deleted Roles
|
||||
- `authentik Read-only` -- no group assignment
|
||||
|
||||
## Policy Fix (2026-04-06)
|
||||
### Unbound brute-force-protection Policy
|
||||
The `brute-force-protection` ReputationPolicy (PK: `ac98cb11-31d3-46ab-8883-bf51e6b09a60`, `check_username=True`, `check_ip=True`, `threshold=-5`) was bound to 3 authentication flows, causing "Flow does not apply to current user" for all unauthenticated users (no username to evaluate → failure_result=false → flow denied).
|
||||
|
||||
Removed bindings from:
|
||||
- `default-authentication-flow` (PK: `34618cf3`) — username/password login
|
||||
- `webauthn` (PK: `0b60c2a5`) — passkey login
|
||||
- `default-source-authentication` (PK: via policybindingmodel `1a779f24`) — Google/GitHub/Facebook OAuth
|
||||
|
||||
Policy still exists with 0 bindings. If brute-force protection is needed, bind to the **password stage** (not the flow level).
|
||||
|
||||
## Session Duration (2026-05-01)
|
||||
|
||||
Pinned via Terraform in `stacks/authentik/`:
|
||||
|
||||
| Knob | Value | Surface | Effect |
|
||||
|------|-------|---------|--------|
|
||||
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
|
||||
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
|
||||
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
|
||||
|
||||
Notes:
|
||||
- There is **no** `Brand.session_duration`; `UserLoginStage` is the only correct lever for authenticated session lifetime.
|
||||
- Embedded outpost session storage: PostgreSQL table `authentik_providers_proxy_proxysession` in authentik 2025.10+ (PR #16628), but **only when `IsEmbedded()` returns true** (i.e. `Outpost.managed == "goauthentik.io/outposts/embedded"`). Our outpost record had `managed=null` until 2026-05-10, which silently kept it on the gorilla `FilesystemStore` at `/dev/shm` (TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: see `authentik_outpost.embedded` in `authentik_provider.tf` and post-mortem `2026-04-18-authentik-outpost-shm-full.md`.
|
||||
- The proxy outpost service has a known goauthentik 2026.2.2 bug (`internal/outpost/controllers/k8s/service.py:52`): for embedded outposts the controller sets the Service selector to `app.kubernetes.io/name=authentik` (the server pods), not `authentik-outpost-proxy`. We work around it via a `kubernetes_json_patches.service` patch on the outpost record (replaces `/spec/selector` with the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realm `Emergency Access`.
|
||||
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
|
||||
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
|
||||
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
|
||||
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
|
||||
|
||||
## Upgrade Validation Checklist
|
||||
|
||||
Run after **any** of these:
|
||||
- Authentik chart version bump in `stacks/authentik/modules/authentik/main.tf` (the `version = "..."` line on `helm_release.authentik`).
|
||||
- `goauthentik/authentik` Terraform provider version bump.
|
||||
- Outpost pod recreation (kured reboot, eviction, manual `rollout restart`, scheduler move).
|
||||
|
||||
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
|
||||
|
||||
```bash
|
||||
# 1. Service routes to the outpost pod (NOT the server pods).
|
||||
# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
|
||||
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
|
||||
|
||||
# 2. Service selector still excludes the server pods. Expected: includes
|
||||
# `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
|
||||
# `name: authentik`, the goauthentik upstream bug came back or our
|
||||
# JSON patch was unset.
|
||||
kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'
|
||||
|
||||
# 3. Outpost mode + session backend. Expected log lines on startup:
|
||||
# {"embedded":true,"event":"Outpost mode",...}
|
||||
# {"event":"using PostgreSQL session backend",...}
|
||||
# If embedded=false or `using filesystem session backend`, the postgres
|
||||
# fix is broken — likely `Outpost.managed` got cleared, or the upstream
|
||||
# schema started exposing `managed` and TF reset it.
|
||||
kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3
|
||||
|
||||
# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
|
||||
# A row count > a few dozen indicates filesystem fallback is firing.
|
||||
kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'
|
||||
|
||||
# 5. Postgres session table is growing with traffic. Expected: rows with
|
||||
# `expires` ~28 days out (matches access_token_validity = weeks=4).
|
||||
kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
|
||||
from django.db import connection; c = connection.cursor()
|
||||
c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
|
||||
print(c.fetchone())"
|
||||
|
||||
# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
|
||||
curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
|
||||
| grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'
|
||||
|
||||
# 7. Terraform plan-to-zero on the whole authentik stack.
|
||||
( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'
|
||||
```
|
||||
|
||||
Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (`AuthentikForwardAuthFallbackActive`, `AuthentikOutpostForwardAuth400Spike`). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.
|
||||
|
||||
If step 2 shows the controller restored `app.kubernetes.io/name=authentik`, watch goauthentik/authentik issue tracker for fixes around `internal/outpost/controllers/k8s/service.py:52` — the upstream patch might let us drop our `kubernetes_json_patches.service` workaround.
|
||||
|
|
@ -1,31 +0,0 @@
|
|||
# GitHub API Reference
|
||||
|
||||
> Token locations and common API patterns.
|
||||
|
||||
## GitHub API
|
||||
- **Username**: `ViktorBarzin`
|
||||
- **Token**: `grep github_pat terraform.tfvars | cut -d'"' -f2` (git-crypt encrypted)
|
||||
- **Scopes**: Full access (repo, admin:public_key, admin:repo_hook, delete_repo, admin:org, workflow, write:packages)
|
||||
- **`gh` CLI**: Blocked by sandbox — use `curl` instead
|
||||
|
||||
```bash
|
||||
GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)
|
||||
|
||||
# List repos
|
||||
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"
|
||||
|
||||
# Create repo
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
|
||||
-d '{"name":"repo-name","private":true}'
|
||||
|
||||
# Add deploy key
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/keys" \
|
||||
-d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'
|
||||
|
||||
# Create webhook
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/hooks" \
|
||||
-d '{"config":{"url":"https://ci.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'
|
||||
```
|
||||
|
||||
## Capabilities
|
||||
- **GitHub**: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages
|
||||
|
|
@ -1,12 +0,0 @@
|
|||
# Known Issues (suppress in all agents)
|
||||
|
||||
## Permanent
|
||||
- ha-london Uptime Kuma monitor down — external HA on Raspberry Pi, not in this cluster
|
||||
- PVFillingUp for navidrome-music — Synology NAS volume, threshold is 95%, expected
|
||||
|
||||
## Intermittent
|
||||
- CrowdSec Helm release stuck in pending-upgrade — known issue, workaround: helm rollback
|
||||
- Resource usage >80% on nodes — WARN only, overcommit is by design (2x LimitRange ratio)
|
||||
|
||||
## How agents consume this file
|
||||
Each agent definition includes: "Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches."
|
||||
|
|
@ -1,115 +0,0 @@
|
|||
# Detailed Infrastructure Patterns
|
||||
|
||||
Reference file for patterns, procedures, and tables. Read on demand when the specific topic comes up.
|
||||
|
||||
## NFS Volume Pattern
|
||||
Use the `nfs_volume` shared module for all NFS volumes (creates static PVs, CSI-backed, `soft,timeo=30,retrans=3`):
|
||||
```hcl
|
||||
module "nfs_data" {
|
||||
source = "../../modules/kubernetes/nfs_volume" # ../../../../ for platform modules, ../../../ for sub-stacks
|
||||
name = "<service>-data" # Must be globally unique (PV is cluster-scoped)
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
nfs_server = var.nfs_server # 192.168.1.127 (Proxmox host)
|
||||
nfs_path = "/srv/nfs/<service>" # HDD NFS, or "/srv/nfs-ssd/<service>" for SSD
|
||||
}
|
||||
# In pod spec: persistent_volume_claim { claim_name = module.nfs_data.claim_name }
|
||||
```
|
||||
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths (from the TrueNAS era). These work via compatibility on the Proxmox host. New PVs should use `/srv/nfs/` or `/srv/nfs-ssd/`.
|
||||
**DO NOT use inline `nfs {}` blocks** — they mount with `hard,timeo=600` defaults which hang forever.
|
||||
|
||||
## Adding NFS Exports
|
||||
1. Create dir on Proxmox host: `ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service> && chmod 777 /srv/nfs/<service>"`
|
||||
2. Edit `/etc/exports` on the Proxmox host — add the export entry
|
||||
3. Reload exports: `ssh root@192.168.1.127 "exportfs -ra"`
|
||||
4. Verify: `showmount -e 192.168.1.127`
|
||||
|
||||
## Static Site Hosting
|
||||
Two patterns for serving a folder of static files (HTML/CSS/JS/media):
|
||||
|
||||
1. **Image-baked** (default for git-native content): bake files into an `nginx:*-alpine` image at build time, deploy like any owned app (CI builds + pushes, Keel/Woodpecker rolls out). Reference: `stacks/blog` (Hugo → nginx, `Website/Dockerfile`). Use when content lives in git and changes via commits.
|
||||
|
||||
2. **NFS-backed** (for externally-authored / large / non-git content): a stock `nginx:1.28-alpine` Deployment mounts an `nfs_volume` PVC **read-only** at `/usr/share/nginx/html`; a tiny ConfigMap supplies `/etc/nginx/conf.d/default.conf` (just `root` + `index <entry>.html`). Files are dropped on `/srv/nfs/<site>` out-of-band (Nextcloud "PVE NFS Pool" or rsync) — no rebuild, auto-backed-up by `nfs-mirror`. Reference: `stacks/stem95su` (established 2026-06-07). Use when content is authored outside git (e.g. exported tools), is large (avoids git/image bloat), or a non-dev updates it. **The export subdir on the PVE host must exist before the pod mounts** — the `nfs_volume` module does NOT create it (see "Adding NFS Exports"; a subdir under the already-exported `/srv/nfs` needs no new `/etc/exports` line).
|
||||
|
||||
Both front with `ingress_factory` (`auth="none"` for open public content → CrowdSec + ai-bot-block still apply; or chain `anubis_instance` for a PoW gate, as `blog` does).
|
||||
|
||||
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
|
||||
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
|
||||
|
||||
## Anti-AI Scraping (4 Active Layers) (Updated 2026-05-10)
|
||||
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.
|
||||
1. **Anubis PoW challenge** (per-site reverse proxy) — `modules/kubernetes/anubis_instance/`. Latest: `ghcr.io/techarohq/anubis:v1.25.0`. Difficulty 2 (~250 ms desktop / ~700 ms mobile), 30-day JWT cookie scoped to `viktorbarzin.me` so a single solve covers every Anubis-fronted subdomain. Active on: `viktorbarzin.me`, `kms.viktorbarzin.me`, `travel.viktorbarzin.me`. Add to a stack: `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<svc>.<ns>.svc.cluster.local" }`, then point ingress_factory at `module.anubis.service_name` + `port = module.anubis.service_port` and set `anti_ai_scraping = false`. Shared ed25519 signing key in Vault `secret/viktor` -> `anubis_ed25519_key`. **Avoid putting Anubis in front of CLI/API/Git endpoints (Forgejo, APIs, WebDAV)** — clients without JS can't solve PoW.
|
||||
2. **Bot blocking forwardAuth** (ForwardAuth → bot-block-proxy → poison-fountain) — global default for non-Anubis sites. `bot-block-proxy` (OpenResty in `traefik` ns) is fail-open with 100 ms connect / 200 ms read timeouts so a downed poison-fountain costs ≤200 ms per request. Source: `stacks/traefik/modules/traefik/main.tf`.
|
||||
3. **X-Robots-Tag noai** — set by `traefik-anti-ai-headers` middleware. Anubis additionally serves a comprehensive `/robots.txt` (`SERVE_ROBOTS_TXT=true`) to well-behaved bots.
|
||||
4. **Tarpit/poison content** (standalone at poison.viktorbarzin.me, `stacks/poison-fountain/`). Currently scaled to `replicas = 0` — fail-open path means no live traffic, no penalty.
|
||||
|
||||
Trap links (formerly a layer) removed April 2026 — rewrite-body plugin broken on Traefik v3.6.12 (Yaegi bugs). `strip-accept-encoding` and `anti-ai-trap-links` middlewares deleted.
|
||||
Rybbit analytics injection now via Cloudflare Worker (`stacks/rybbit/worker/`, HTMLRewriter, wildcard route `*.viktorbarzin.me/*`, 28 site ID mappings).
|
||||
Key files: `modules/kubernetes/anubis_instance/`, `stacks/poison-fountain/`, `stacks/rybbit/worker/`, `stacks/traefik/modules/traefik/main.tf`
|
||||
|
||||
## Terragrunt Architecture
|
||||
- Root `terragrunt.hcl`: DRY providers, backend, variable loading, `generate "tiers"` block
|
||||
- Each stack: `stacks/<service>/main.tf`, state at `state/stacks/<service>/terraform.tfstate`
|
||||
- Platform modules: `stacks/platform/modules/<service>/`, shared: `modules/kubernetes/`
|
||||
- Syntax: `--non-interactive`, `terragrunt run --all -- <command>` (not `run-all`)
|
||||
- Tiers auto-generated into `tiers.tf` — never add `locals { tiers = {} }` manually
|
||||
|
||||
## Factory Pattern (Multi-User Services)
|
||||
Structure: `stacks/<service>/main.tf` + `factory/main.tf`. Examples: `actualbudget`, `freedify`.
|
||||
To add a user: export NFS share, add Cloudflare route in tfvars, add module block calling factory.
|
||||
|
||||
## Node Rebuild Procedure
|
||||
1. Drain: `kubectl drain k8s-nodeX --ignore-daemonsets --delete-emptydir-data`
|
||||
2. Delete: `kubectl delete node k8s-nodeX`
|
||||
3. Destroy VM (remove from `stacks/infra/main.tf`)
|
||||
4. Get fresh join command: `ssh wizard@10.0.20.100 'sudo kubeadm token create --print-join-command'` (tokens expire 24h)
|
||||
5. Update `k8s_join_command` in `terraform.tfvars`, add VM to `stacks/infra/main.tf`, apply
|
||||
6. GPU node (k8s-node1): apply platform stack to re-apply GPU label/taint
|
||||
|
||||
## Kyverno Resource Governance
|
||||
|
||||
### LimitRange Defaults (injected when no explicit `resources {}`)
|
||||
| Tier | Default Mem | Max Mem | Default CPU | Max CPU |
|
||||
|------|------------|---------|-------------|---------|
|
||||
| 0-core | 512Mi | 8Gi | 500m | 4 |
|
||||
| 1-cluster | 512Mi | 4Gi | 500m | 2 |
|
||||
| 2-gpu | 2Gi | 16Gi | 1 | 8 |
|
||||
| 3-edge / 4-aux | 256Mi | 4Gi | 250m | 2 |
|
||||
| No tier | 256Mi | 2Gi | 250m | 1 |
|
||||
|
||||
### ResourceQuota (opt-out: `resource-governance/custom-quota=true`)
|
||||
| Tier | lim CPU | lim Mem | Pods |
|
||||
|------|---------|---------|------|
|
||||
| 0-core | 32 | 64Gi | 100 |
|
||||
| 1-cluster | 16 | 32Gi | 30 |
|
||||
| 2-gpu | 48 | 96Gi | 40 |
|
||||
| 3-edge / 4-aux | 8-16 | 16-32Gi | 20-30 |
|
||||
|
||||
Custom quotas: authentik, monitoring (opted out), nvidia (opted out), nextcloud, onlyoffice.
|
||||
LimitRange opt-out: `resource-governance/custom-limitrange=true` + custom `kubernetes_limit_range` in stack.
|
||||
|
||||
### Other Policies
|
||||
- `inject-priority-class-from-tier` (CREATE only), `inject-ndots` (ndots:2), `sync-tier-label`
|
||||
- `goldilocks-vpa-auto-mode`: VPA `off` globally — Terraform owns resources, Goldilocks observe-only
|
||||
- Security policies ALL Audit mode: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries`
|
||||
|
||||
### Debugging Container Failures
|
||||
1. **OOMKilled?** → `kubectl describe limitrange tier-defaults -n <ns>`. edge/aux default = 256Mi.
|
||||
2. **Won't schedule?** → `kubectl describe resourcequota tier-quota -n <ns>`.
|
||||
3. **Evicted?** → aux-tier pods (priority 200K, Never preempt) evicted first.
|
||||
4. **Unexpected limits?** → LimitRange injects defaults. Always set explicit resources.
|
||||
5. **Need more?** → Set explicit `resources {}` or add quota/limitrange opt-out labels.
|
||||
|
||||
## Authentik (Identity Provider)
|
||||
- **URL**: `https://authentik.viktorbarzin.me` | **API**: `/api/v3/` | **Token**: `authentik_api_token` in tfvars
|
||||
- 3 server + 3 worker + 3 PgBouncer + embedded outpost
|
||||
- Forward auth: `protected = true` in ingress_factory
|
||||
- OIDC for K8s: issuer `.../application/o/kubernetes/`, client `kubernetes` (public)
|
||||
- See archived skills for management tasks and OIDC gotchas
|
||||
|
||||
## Archived Troubleshooting Runbooks
|
||||
28 skills in `.claude/skills/archived/` — load when the specific issue arises.
|
||||
Topics: authentik, bluestacks, clickhouse-nfs, coturn, crowdsec, fastapi-svelte-gpu,
|
||||
grafana-datasource, helm-stuck, ingress-migration, image-caching, gpu-devices, hpa-storm,
|
||||
nfs-mount, kubelet-manifest, llm-gpu, loki-helm, librespot, nextcloud-calendar, nfsv4-idmapd,
|
||||
openclaw-deploy, pfsense-dnsmasq, pfsense-nat, proxmox-disk, python-sanitize, terraform-state,
|
||||
traefik-helm, traefik-rewrite-body.
|
||||
|
|
@ -1,130 +0,0 @@
|
|||
# Proxmox Inventory & Infrastructure
|
||||
|
||||
> Static reference for VMs, hardware, and network topology.
|
||||
|
||||
## Proxmox Host Hardware
|
||||
- **Model**: Dell R730
|
||||
- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket, CPU2 unpopulated)
|
||||
- **RAM**: 272 GB DDR4-2400 ECC RDIMM (10 DIMMs, see Memory Layout below)
|
||||
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
|
||||
- **iDRAC**: 192.168.1.4 (root/calvin)
|
||||
- **Disks**: 1.1TB RAID1 SAS (backup) + 931GB Samsung SSD + 10.7TB RAID1 HDD
|
||||
- **NFS server**: Proxmox host serves NFS directly. HDD NFS: `/srv/nfs` on ext4 LV `pve/nfs-data` (2TB). SSD NFS: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (10.0.10.15) decommissioned.
|
||||
- **Proxmox access**: `ssh root@192.168.1.127`
|
||||
|
||||
## Memory Layout (updated 2026-04-01)
|
||||
|
||||
### Physical DIMM Slot Map
|
||||
|
||||
```
|
||||
╔══════════════════════════════════════════════════════════════════════════════╗
|
||||
║ CPU1 DIMM SLOTS ║
|
||||
║ ║
|
||||
║ ┌─── WHITE (1st per channel) ───┐ ║
|
||||
║ │ │ ║
|
||||
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
|
||||
║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║
|
||||
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ Samsung M393A4K40BB1-CRC (2R) ║
|
||||
║ │ │██████│ │██████│ │██████│ │██████│ ║
|
||||
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
|
||||
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
|
||||
║ └────────────────────────────────┘ ║
|
||||
║ ║
|
||||
║ ┌─── BLACK (2nd per channel) ───┐ ║
|
||||
║ │ │ ║
|
||||
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
|
||||
║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║
|
||||
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ Samsung M393A4K40CB1-CRC (2R) ║
|
||||
║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ ║
|
||||
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
|
||||
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
|
||||
║ └────────────────────────────────┘ ║
|
||||
║ ║
|
||||
║ ┌─── GREEN (3rd per channel) ───┐ ║
|
||||
║ │ │ ║
|
||||
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
|
||||
║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║
|
||||
║ │ │ │ │ │ │ 8G │ │ 8G │ SK Hynix HMA81GR7AFR8N-UH (1R) ║
|
||||
║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ ║
|
||||
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
|
||||
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
|
||||
║ └────────────────────────────────┘ ║
|
||||
║ ║
|
||||
║ B1-B12: All empty (requires CPU2) ║
|
||||
║ ║
|
||||
║ Legend: ██ = Samsung BB1 32G ▓▓ = Samsung CB1 32G ░░ = Hynix 8G ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════╝
|
||||
```
|
||||
|
||||
### Channel Summary
|
||||
|
||||
```
|
||||
Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched
|
||||
Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched
|
||||
Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus
|
||||
Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus
|
||||
───────── ───────── ──────────
|
||||
WHITE BLACK GREEN TOTAL: 272 GB
|
||||
```
|
||||
|
||||
### DIMM Details
|
||||
|
||||
- **A1-A4**: Samsung M393A4K40BB1-CRC 32GB DDR4-2400 ECC RDIMM (2-rank, original)
|
||||
- **A5-A8**: Samsung M393A4K40CB1-CRC 32GB DDR4-2400 ECC RDIMM (2-rank, added 2026-04-01)
|
||||
- **A11-A12**: SK Hynix HMA81GR7AFR8N-UH 8GB DDR4-2400 ECC RDIMM (1-rank, relocated from A5/A6)
|
||||
- **A9-A10, B1-B12**: Empty (B-side requires CPU2)
|
||||
- **Speed**: 2400 MHz (BIOS override — 3 DPC defaults to 1866 MHz, forced to 2400 via System BIOS > Memory Settings > Memory Frequency)
|
||||
|
||||
## Network Topology
|
||||
```
|
||||
10.0.10.0/24 - Management: Wizard (10.0.10.10)
|
||||
10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10),
|
||||
k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200)
|
||||
192.168.1.0/24 - Physical: Proxmox (192.168.1.127)
|
||||
```
|
||||
|
||||
## Network Bridges
|
||||
- **vmbr0**: Physical bridge on `eno1`, IP `192.168.1.127/24` — physical/home network
|
||||
- **vmbr1**: Internal-only bridge, VLAN-aware — VLAN 10 (management) and VLAN 20 (kubernetes)
|
||||
|
||||
## VM Inventory
|
||||
|
||||
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|
||||
|------|------|--------|------|-----|---------|------|-------|
|
||||
| 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
|
||||
| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 8G swapfile (swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. |
|
||||
| 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
|
||||
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
|
||||
| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
|
||||
| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
|
||||
| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
|
||||
| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
|
||||
| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
|
||||
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
|
||||
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
|
||||
| ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |
|
||||
|
||||
**Total VM RAM allocated**: 196 GB of 272 GB (72%) — 76 GB free for future VMs (devvm corrected 8GB→24GB 2026-06-08)
|
||||
|
||||
## VM Templates
|
||||
| VMID | Name | Purpose |
|
||||
|------|------|---------|
|
||||
| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base for non-K8s VMs |
|
||||
| 1001 | docker-registry-template | Docker registry VM |
|
||||
| 2000 | ubuntu-2404-cloudinit-k8s-template | Base for K8s nodes |
|
||||
|
||||
## PVE Host Systemd Services (Custom)
|
||||
|
||||
| Unit | Type | Schedule | Purpose |
|
||||
|------|------|----------|---------|
|
||||
| `lvm-pvc-snapshot.timer` | Timer | Daily 03:00 | LVM thin snapshots of all PVCs (7-day retention) |
|
||||
| `daily-backup.timer` | Timer | Daily 05:00 | PVC file backup, auto SQLite backup, pfSense, PVE config |
|
||||
| `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) |
|
||||
| `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` |
|
||||
|
||||
## GPU Node (currently k8s-node1)
|
||||
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) — physical passthrough, no Terraform pin
|
||||
- **Taint**: `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to every NFD-discovered GPU node)
|
||||
- **Label**: `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD)
|
||||
- GPU workloads need: `node_selector = { "nvidia.com/gpu.present" : "true" }` + nvidia toleration
|
||||
- Taint applied via `null_resource.gpu_node_config` in `stacks/nvidia/modules/nvidia/main.tf`; node discovery keyed on the NFD `pci-10de.present` label so the taint follows the card to whichever host is carrying it
|
||||
|
|
@ -116,7 +116,7 @@
|
|||
| status-page | Status page | status-page |
|
||||
| plotting-book | Book plotting/world-building app | plotting-book |
|
||||
| tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit |
|
||||
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy is ON-DEMAND, no scheduled job** (deliberate — short-term content, avoid rotting artifacts): mirror Drive→NFS via a throwaway `rclone/rclone` container using the existing `google_workspace` OAuth creds in Vault `secret/viktor` (`google_workspace_mcp_token_json`) → rsync to `/srv/nfs/stem-site` (empty-source guard). Just ask Claude to "sync stem95su from Drive" (recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync still works as a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su |
|
||||
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su |
|
||||
| trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek |
|
||||
|
||||
## Cloudflare Domains
|
||||
|
|
|
|||
|
|
@ -1,164 +0,0 @@
|
|||
{
|
||||
"github_repo_overrides": {
|
||||
"ghcr.io/immich-app/immich-server": "immich-app/immich",
|
||||
"ghcr.io/immich-app/immich-machine-learning": "immich-app/immich",
|
||||
"docker.io/vaultwarden/server": "dani-garcia/vaultwarden",
|
||||
"vaultwarden/server": "dani-garcia/vaultwarden",
|
||||
"docker.io/mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
|
||||
"mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
|
||||
"docker.n8n.io/n8nio/n8n": "n8n-io/n8n",
|
||||
"headscale/headscale": "juanfont/headscale",
|
||||
"technitium/dns-server": "TechnitiumSoftware/DnsServer",
|
||||
"ghcr.io/paperless-ngx/paperless-ngx": "paperless-ngx/paperless-ngx",
|
||||
"ghcr.io/blakeblackshear/frigate": "blakeblackshear/frigate",
|
||||
"ghcr.io/dgtlmoon/changedetection.io": "dgtlmoon/changedetection.io",
|
||||
"ghcr.io/linkwarden/linkwarden": "linkwarden/linkwarden",
|
||||
"ghcr.io/open-webui/open-webui": "open-webui/open-webui",
|
||||
"ghcr.io/advplyr/audiobookshelf": "advplyr/audiobookshelf",
|
||||
"ghcr.io/browserless/chromium": "browserless/chromium",
|
||||
"ghcr.io/rybbit-io/rybbit-backend": "rybbit-io/rybbit",
|
||||
"ghcr.io/rybbit-io/rybbit-client": "rybbit-io/rybbit",
|
||||
"ghcr.io/gurucomputing/headscale-ui": "gurucomputing/headscale-ui",
|
||||
"ghcr.io/dmunozv04/isponsorblocktv": "dmunozv04/iSponsorBlockTV",
|
||||
"ghcr.io/gramps-project/grampsweb": "gramps-project/gramps-web",
|
||||
"ghcr.io/project-osrm/osrm-backend": "Project-OSRM/osrm-backend",
|
||||
"ghcr.io/flaresolverr/flaresolverr": "FlareSolverr/FlareSolverr",
|
||||
"ghcr.io/therobbiedavis/listenarr": "therobbiedavis/listenarr",
|
||||
"ghcr.io/immichframe/immichframe": "immichframe/ImmichFrame",
|
||||
"lscr.io/linuxserver/qbittorrent": "linuxserver/docker-qbittorrent",
|
||||
"lscr.io/linuxserver/lidarr": "linuxserver/docker-lidarr",
|
||||
"lscr.io/linuxserver/prowlarr": "linuxserver/docker-prowlarr",
|
||||
"lscr.io/linuxserver/readarr": "linuxserver/docker-readarr",
|
||||
"lscr.io/linuxserver/speedtest-tracker": "linuxserver/docker-speedtest-tracker",
|
||||
"privatebin/nginx-fpm-alpine": "PrivateBin/PrivateBin",
|
||||
"freshrss/freshrss": "FreshRSS/FreshRSS",
|
||||
"hackmdio/hackmd": "hackmdio/codimd",
|
||||
"onlyoffice/documentserver": "ONLYOFFICE/DocumentServer",
|
||||
"netboxcommunity/netbox": "netbox-community/netbox",
|
||||
"stirlingtools/stirling-pdf": "Stirling-Tools/Stirling-PDF",
|
||||
"phpipam/phpipam-www": "phpipam/phpipam",
|
||||
"rhasspy/wyoming-whisper": "rhasspy/wyoming-addons",
|
||||
"rhasspy/wyoming-piper": "rhasspy/wyoming-addons",
|
||||
"clickhouse/clickhouse-server": "ClickHouse/ClickHouse",
|
||||
"docker.io/athomasson2/ebook2audiobook": "athomasson2/ebook2audiobook",
|
||||
"amruthpillai/reactive-resume": "AmruthPillworking/Reactive-Resume",
|
||||
"dpage/pgadmin4": "pgadmin-org/pgadmin4",
|
||||
"ghcr.io/yourok/torrserver": "YouROK/TorrServer",
|
||||
"opentripplanner/opentripplanner": "opentripplanner/OpenTripPlanner",
|
||||
"codeberg.org/forgejo/forgejo": "forgejo/forgejo",
|
||||
"shlinkio/shlink": "shlinkio/shlink",
|
||||
"shlinkio/shlink-web-client": "shlinkio/shlink-web-client",
|
||||
"dgtlmoon/sockpuppetbrowser": "dgtlmoon/sockpuppetbrowser"
|
||||
},
|
||||
"helm_chart_repo_overrides": {
|
||||
"https://charts.goauthentik.io/": "goauthentik/authentik",
|
||||
"https://traefik.github.io/charts": "traefik/traefik-helm-chart",
|
||||
"https://kyverno.github.io/kyverno/": "kyverno/kyverno",
|
||||
"https://mysql.github.io/mysql-operator/": "mysql/mysql-operator",
|
||||
"https://cloudnative-pg.github.io/charts": "cloudnative-pg/cloudnative-pg",
|
||||
"https://charts.external-secrets.io": "external-secrets/external-secrets",
|
||||
"https://metallb.github.io/metallb": "metallb/metallb",
|
||||
"https://nextcloud.github.io/helm/": "nextcloud/helm",
|
||||
"https://crowdsecurity.github.io/helm-charts": "crowdsecurity/helm-charts",
|
||||
"https://helm.releases.hashicorp.com": "hashicorp/vault-helm",
|
||||
"https://bitnami-labs.github.io/sealed-secrets": "bitnami-labs/sealed-secrets",
|
||||
"https://grafana.github.io/helm-charts": "grafana/helm-charts",
|
||||
"https://prometheus-community.github.io/helm-charts": "prometheus-community/helm-charts",
|
||||
"https://democratic-csi.github.io/charts/": "democratic-csi/democratic-csi",
|
||||
"https://stakater.github.io/stakater-charts": "stakater/Reloader",
|
||||
"https://topolvm.github.io/pvc-autoresizer": "topolvm/pvc-autoresizer",
|
||||
"https://kubernetes-sigs.github.io/descheduler/": "kubernetes-sigs/descheduler",
|
||||
"https://kubernetes-sigs.github.io/metrics-server/": "kubernetes-sigs/metrics-server",
|
||||
"https://charts.fairwinds.com/stable": "FairwindsOps/goldilocks",
|
||||
"https://helm.ngc.nvidia.com/nvidia": "NVIDIA/gpu-operator",
|
||||
"oci://ghcr.io/woodpecker-ci/helm": "woodpecker-ci/helm",
|
||||
"oci://10.0.20.10:5000/bitnamicharts": "bitnami/charts"
|
||||
},
|
||||
"db_backed_services": {
|
||||
"affine": { "type": "postgresql", "db_name": "affine", "shared": true },
|
||||
"claude-memory": { "type": "postgresql", "db_name": "claude_memory", "shared": true },
|
||||
"crowdsec": { "type": "postgresql", "db_name": "crowdsec", "shared": true },
|
||||
"dawarich": { "type": "postgresql", "db_name": "dawarich", "shared": true },
|
||||
"health": { "type": "postgresql", "db_name": "health", "shared": true },
|
||||
"linkwarden": { "type": "postgresql", "db_name": "linkwarden", "shared": true },
|
||||
"n8n": { "type": "postgresql", "db_name": "n8n", "shared": true },
|
||||
"netbox": { "type": "postgresql", "db_name": "netbox", "shared": true },
|
||||
"rybbit": { "type": "postgresql", "db_name": "rybbit", "shared": true },
|
||||
"tandoor": { "type": "postgresql", "db_name": "tandoor", "shared": true },
|
||||
"technitium": { "type": "postgresql", "db_name": "technitium", "shared": true },
|
||||
"trading-bot": { "type": "postgresql", "db_name": "trading_bot", "shared": true },
|
||||
"woodpecker": { "type": "postgresql", "db_name": "woodpecker", "shared": true },
|
||||
"immich": { "type": "postgresql", "db_name": "immich", "dedicated": true, "backup_cronjob": "postgresql-backup", "backup_namespace": "immich" },
|
||||
"authentik": { "type": "postgresql", "dedicated": true, "notes": "Uses PgBouncer, managed by Helm chart" },
|
||||
"hackmd": { "type": "mysql", "db_name": "codimd", "shared": true },
|
||||
"mailserver": { "type": "mysql", "db_name": "mailserver", "shared": true },
|
||||
"monitoring": { "type": "mysql", "db_name": "monitoring", "shared": true, "notes": "Grafana backend" },
|
||||
"nextcloud": { "type": "mysql", "db_name": "nextcloud", "shared": true },
|
||||
"onlyoffice": { "type": "mysql", "db_name": "onlyoffice", "shared": true },
|
||||
"paperless-ngx": { "type": "mysql", "db_name": "paperless_ngx", "shared": true },
|
||||
"phpipam": { "type": "mysql", "db_name": "phpipam", "shared": true },
|
||||
"real-estate-crawler": { "type": "mysql", "db_name": "wrongmove", "shared": true },
|
||||
"speedtest": { "type": "mysql", "db_name": "speedtest", "shared": true },
|
||||
"url": { "type": "mysql", "db_name": "shlink", "shared": true },
|
||||
"vault": { "type": "mysql", "db_name": "vault", "shared": true }
|
||||
},
|
||||
"backup_infrastructure": {
|
||||
"postgresql": {
|
||||
"cronjob_name": "postgresql-backup",
|
||||
"namespace": "dbaas",
|
||||
"credential_secret": "pg-cluster-superuser",
|
||||
"credential_key": "password",
|
||||
"host": "pg-cluster-rw.dbaas",
|
||||
"backup_pvc": "dbaas-postgresql-backup-host"
|
||||
},
|
||||
"mysql": {
|
||||
"cronjob_name": "mysql-backup",
|
||||
"namespace": "dbaas",
|
||||
"credential_secret": "cluster-secret",
|
||||
"credential_key": "ROOT_PASSWORD",
|
||||
"host": "mysql.dbaas",
|
||||
"backup_pvc": "dbaas-mysql-backup-host"
|
||||
}
|
||||
},
|
||||
"version_jump_always_step": [
|
||||
"authentik",
|
||||
"nextcloud",
|
||||
"immich"
|
||||
],
|
||||
"auto_detect_rules": {
|
||||
"ghcr.io/{org}/{repo}": "Use org/repo directly, strip -server/-backend suffixes if repo 404s",
|
||||
"docker.io/{org}/{repo}": "Try org/repo on GitHub",
|
||||
"lscr.io/linuxserver/{app}": "Map to linuxserver/docker-{app}",
|
||||
"quay.io/{org}/{repo}": "Try org/repo on GitHub",
|
||||
"registry.gitlab.com/{org}/{repo}": "Try org/repo on GitHub (may be GitLab-only)"
|
||||
},
|
||||
"skip_image_patterns": [
|
||||
"viktorbarzin/*",
|
||||
"registry.viktorbarzin.me/*",
|
||||
"ancamilea/*",
|
||||
"mghee/*",
|
||||
"*postgres*",
|
||||
"*mysql*",
|
||||
"*redis*",
|
||||
"*clickhouse*",
|
||||
"*etcd*",
|
||||
"registry.k8s.io/*",
|
||||
"quay.io/tigera/*",
|
||||
"quay.io/metallb/*",
|
||||
"nvcr.io/*",
|
||||
"reg.kyverno.io/*"
|
||||
],
|
||||
"breaking_change_keywords": [
|
||||
"breaking",
|
||||
"BREAKING",
|
||||
"migration required",
|
||||
"schema change",
|
||||
"database migration",
|
||||
"manual intervention",
|
||||
"action required",
|
||||
"removed",
|
||||
"deprecated",
|
||||
"renamed",
|
||||
"incompatible"
|
||||
]
|
||||
}
|
||||
|
|
@ -1,134 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
AGENT="authentik-audit"
|
||||
DRY_RUN=false
|
||||
NAMESPACE="authentik"
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
checks=()
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||
}
|
||||
|
||||
find_authentik_pod() {
|
||||
local pod
|
||||
pod=$($KUBECTL get pods -n "$NAMESPACE" -l app.kubernetes.io/name=authentik,app.kubernetes.io/component=server -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) || \
|
||||
pod=$($KUBECTL get pods -n "$NAMESPACE" --no-headers 2>/dev/null | grep -i "goauthentik-server\|authentik-server" | grep "Running" | head -1 | awk '{print $1}') || true
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
check_server_health() {
|
||||
if $DRY_RUN; then
|
||||
add_check "authentik-server" "ok" "dry-run: would check goauthentik-server pod health"
|
||||
return
|
||||
fi
|
||||
|
||||
local pods
|
||||
pods=$($KUBECTL get pods -n "$NAMESPACE" --no-headers 2>/dev/null | grep -i "authentik") || {
|
||||
add_check "authentik-server" "fail" "No Authentik pods found in namespace ${NAMESPACE}"
|
||||
return
|
||||
}
|
||||
|
||||
local not_running
|
||||
not_running=$(echo "$pods" | grep -v "Running" | grep -v "Completed" | grep -c "." 2>/dev/null || echo "0")
|
||||
|
||||
local total
|
||||
total=$(echo "$pods" | grep -c "." 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$not_running" -gt 0 ]; then
|
||||
add_check "authentik-server" "warn" "${not_running}/${total} Authentik pod(s) not running"
|
||||
else
|
||||
add_check "authentik-server" "ok" "All ${total} Authentik pod(s) running"
|
||||
fi
|
||||
}
|
||||
|
||||
check_outposts() {
|
||||
if $DRY_RUN; then
|
||||
add_check "authentik-outposts" "ok" "dry-run: would check Authentik outpost pods"
|
||||
return
|
||||
fi
|
||||
|
||||
local outpost_pods
|
||||
outpost_pods=$($KUBECTL get pods -n "$NAMESPACE" -l app.kubernetes.io/managed-by=goauthentik.io --no-headers 2>/dev/null) || \
|
||||
outpost_pods=$($KUBECTL get pods -n "$NAMESPACE" --no-headers 2>/dev/null | grep -i "outpost" || true)
|
||||
|
||||
if [ -z "$outpost_pods" ]; then
|
||||
add_check "authentik-outposts" "warn" "No outpost pods found"
|
||||
return
|
||||
fi
|
||||
|
||||
local total not_running
|
||||
total=$(echo "$outpost_pods" | grep -c "." 2>/dev/null || echo "0")
|
||||
not_running=$(echo "$outpost_pods" | grep -v "Running" | grep -c "." 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$not_running" -gt 0 ]; then
|
||||
add_check "authentik-outposts" "warn" "${not_running}/${total} outpost pod(s) not running"
|
||||
else
|
||||
add_check "authentik-outposts" "ok" "All ${total} outpost pod(s) running"
|
||||
fi
|
||||
}
|
||||
|
||||
check_user_count() {
|
||||
if $DRY_RUN; then
|
||||
add_check "authentik-users" "ok" "dry-run: would check user count via ak CLI"
|
||||
return
|
||||
fi
|
||||
|
||||
local pod
|
||||
pod=$(find_authentik_pod)
|
||||
|
||||
if [ -z "$pod" ]; then
|
||||
add_check "authentik-users" "warn" "No Authentik server pod found to query users"
|
||||
return
|
||||
fi
|
||||
|
||||
# Use the ak CLI to get user count
|
||||
local user_output
|
||||
user_output=$($KUBECTL exec -n "$NAMESPACE" "$pod" -- ak user list 2>/dev/null) || {
|
||||
# Fallback: try management command
|
||||
user_output=$($KUBECTL exec -n "$NAMESPACE" "$pod" -- python -c "
|
||||
import django; django.setup()
|
||||
from authentik.core.models import User
|
||||
print(f'total={User.objects.count()} active={User.objects.filter(is_active=True).count()}')
|
||||
" 2>/dev/null) || {
|
||||
add_check "authentik-users" "warn" "Could not query user count from Authentik"
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
local user_count
|
||||
if echo "$user_output" | grep -q "total="; then
|
||||
user_count=$(echo "$user_output" | grep "total=" | sed 's/.*total=\([0-9]*\).*/\1/')
|
||||
local active_count
|
||||
active_count=$(echo "$user_output" | grep "active=" | sed 's/.*active=\([0-9]*\).*/\1/')
|
||||
add_check "authentik-users" "ok" "${user_count} total users, ${active_count} active"
|
||||
else
|
||||
# Count lines of output as fallback
|
||||
user_count=$(echo "$user_output" | wc -l | tr -d ' ')
|
||||
add_check "authentik-users" "ok" "User query returned ${user_count} lines of output"
|
||||
fi
|
||||
}
|
||||
|
||||
check_server_health
|
||||
check_outposts
|
||||
check_user_count
|
||||
|
||||
# Output JSON
|
||||
overall="ok"
|
||||
for c in "${checks[@]}"; do
|
||||
s=$(echo "$c" | jq -r '.status')
|
||||
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||
done
|
||||
|
||||
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||
|
|
@ -1,180 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Authentik Invitation Management Script
|
||||
# Usage:
|
||||
# ./authentik-invite.sh create "Group Name" # Single-use, no expiry
|
||||
# ./authentik-invite.sh create "Group Name" --days 7 # Expires in 7 days
|
||||
# ./authentik-invite.sh assign <username> "Group Name" # Add user to group
|
||||
# ./authentik-invite.sh list # Show pending invitations
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
INFRA_DIR="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
|
||||
API="https://authentik.viktorbarzin.me/api/v3"
|
||||
FLOW_SLUG="invitation-enrollment"
|
||||
|
||||
get_token() {
|
||||
grep authentik_api_token "$INFRA_DIR/terraform.tfvars" | cut -d'"' -f2
|
||||
}
|
||||
|
||||
api_get() {
|
||||
curl -sf -H "Authorization: Bearer $(get_token)" "$API/$1"
|
||||
}
|
||||
|
||||
api_post() {
|
||||
curl -sf -X POST \
|
||||
-H "Authorization: Bearer $(get_token)" \
|
||||
-H "Content-Type: application/json" \
|
||||
"$API/$1" -d "$2"
|
||||
}
|
||||
|
||||
api_patch() {
|
||||
curl -sf -X PATCH \
|
||||
-H "Authorization: Bearer $(get_token)" \
|
||||
-H "Content-Type: application/json" \
|
||||
"$API/$1" -d "$2"
|
||||
}
|
||||
|
||||
cmd_create() {
|
||||
local group_name="${1:?Usage: create <group-name> [--days N]}"
|
||||
local days=""
|
||||
|
||||
shift
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--days) days="$2"; shift 2 ;;
|
||||
*) echo "Unknown option: $1"; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Build invitation payload
|
||||
# Get flow PK
|
||||
local flow_pk
|
||||
flow_pk=$(api_get "flows/instances/$FLOW_SLUG/" | python3 -c "import json,sys; print(json.load(sys.stdin)['pk'])")
|
||||
|
||||
local payload
|
||||
payload=$(python3 -c "
|
||||
import json, sys, re
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
slug = re.sub(r'[^a-z0-9-]', '-', '$group_name'.lower()).strip('-')
|
||||
data = {
|
||||
'name': 'invite-' + slug + '-' + datetime.now(timezone.utc).strftime('%Y%m%d-%H%M'),
|
||||
'single_use': True,
|
||||
'fixed_data': {'group': '$group_name'},
|
||||
'flow': '$flow_pk'
|
||||
}
|
||||
|
||||
days = '$days'
|
||||
if days:
|
||||
expires = datetime.now(timezone.utc) + timedelta(days=int(days))
|
||||
data['expires'] = expires.isoformat()
|
||||
|
||||
print(json.dumps(data))
|
||||
")
|
||||
|
||||
local result
|
||||
result=$(api_post "stages/invitation/invitations/" "$payload")
|
||||
local token
|
||||
token=$(echo "$result" | python3 -c "import json,sys; print(json.load(sys.stdin)['pk'])")
|
||||
|
||||
echo ""
|
||||
echo "Invitation created for group: $group_name"
|
||||
if [[ -n "$days" ]]; then
|
||||
echo "Expires in: $days days"
|
||||
else
|
||||
echo "Expires: never"
|
||||
fi
|
||||
echo "Single-use: yes"
|
||||
echo ""
|
||||
echo "Share this link:"
|
||||
echo " https://authentik.viktorbarzin.me/if/flow/$FLOW_SLUG/?itoken=$token"
|
||||
echo ""
|
||||
}
|
||||
|
||||
cmd_assign() {
|
||||
local username="${1:?Usage: assign <username> <group-name>}"
|
||||
local group_name="${2:?Usage: assign <username> <group-name>}"
|
||||
|
||||
# Find user PK
|
||||
local user_pk
|
||||
user_pk=$(api_get "core/users/?search=$username" | python3 -c "
|
||||
import json, sys
|
||||
users = json.load(sys.stdin)['results']
|
||||
if not users:
|
||||
print('NOT_FOUND', file=sys.stderr)
|
||||
sys.exit(1)
|
||||
print(users[0]['pk'])
|
||||
")
|
||||
|
||||
# Find group PK and current users
|
||||
local group_data
|
||||
group_data=$(api_get "core/groups/?search=$(python3 -c "import urllib.parse; print(urllib.parse.quote('$group_name'))")" | python3 -c "
|
||||
import json, sys
|
||||
groups = json.load(sys.stdin)['results']
|
||||
matches = [g for g in groups if g['name'] == '$group_name']
|
||||
if not matches:
|
||||
print('NOT_FOUND', file=sys.stderr)
|
||||
sys.exit(1)
|
||||
g = matches[0]
|
||||
users = g.get('users', [])
|
||||
print(json.dumps({'pk': g['pk'], 'users': users}))
|
||||
")
|
||||
|
||||
local group_pk
|
||||
group_pk=$(echo "$group_data" | python3 -c "import json,sys; print(json.load(sys.stdin)['pk'])")
|
||||
|
||||
# Add user to group
|
||||
local updated_users
|
||||
updated_users=$(echo "$group_data" | python3 -c "
|
||||
import json, sys
|
||||
d = json.load(sys.stdin)
|
||||
users = d['users']
|
||||
uid = $user_pk
|
||||
if uid not in users:
|
||||
users.append(uid)
|
||||
print(json.dumps(users))
|
||||
")
|
||||
|
||||
api_patch "core/groups/$group_pk/" "{\"users\": $updated_users}" > /dev/null
|
||||
|
||||
echo "Added $username (pk=$user_pk) to group '$group_name'"
|
||||
}
|
||||
|
||||
cmd_list() {
|
||||
api_get "stages/invitation/invitations/?page_size=50" | python3 -c "
|
||||
import json, sys
|
||||
data = json.load(sys.stdin)
|
||||
if not data['results']:
|
||||
print('No pending invitations.')
|
||||
sys.exit(0)
|
||||
|
||||
print(f\"{'Token (itoken)':<40} {'Name':<50} {'Single-Use':<12} {'Expires':<25} {'Group'}\")
|
||||
print('-' * 160)
|
||||
for inv in data['results']:
|
||||
token = inv['pk']
|
||||
name = inv.get('name', '')
|
||||
single = 'yes' if inv.get('single_use') else 'no'
|
||||
expires = inv.get('expires') or 'never'
|
||||
if expires != 'never':
|
||||
expires = expires[:19]
|
||||
group = inv.get('fixed_data', {}).get('group', '—')
|
||||
print(f'{token:<40} {name:<50} {single:<12} {expires:<25} {group}')
|
||||
print(f\"\\nTotal: {data['pagination']['count']}\")
|
||||
"
|
||||
}
|
||||
|
||||
case "${1:-help}" in
|
||||
create) shift; cmd_create "$@" ;;
|
||||
assign) shift; cmd_assign "$@" ;;
|
||||
list) cmd_list ;;
|
||||
*)
|
||||
echo "Authentik Invitation Manager"
|
||||
echo ""
|
||||
echo "Usage:"
|
||||
echo " $0 create <group-name> [--days N] Create single-use invite link"
|
||||
echo " $0 assign <username> <group-name> Add user to group"
|
||||
echo " $0 list Show pending invitations"
|
||||
;;
|
||||
esac
|
||||
|
|
@ -1,566 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# backup-verify.sh — Full 3-2-1 backup health inspection
|
||||
# Checks: LVM snapshots, weekly backup, PVC file copies, pfsense, NFS mirror,
|
||||
# offsite sync, DB CronJobs, CNPG backups
|
||||
# Usage: backup-verify.sh [--fix] [--dry-run]
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/config"
|
||||
PVE_SSH="ssh -o ConnectTimeout=5 -o BatchMode=yes root@192.168.1.127"
|
||||
DRY_RUN=false
|
||||
FIX=false
|
||||
AGENT="backup-verify"
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
--fix) FIX=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
CHECKS="[]"
|
||||
PVE_REACHABLE=true
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||
json.dump(checks, sys.stdout)
|
||||
")
|
||||
}
|
||||
|
||||
# Test PVE host connectivity (all Layer 1+2 checks depend on this)
|
||||
check_pve_connectivity() {
|
||||
if $DRY_RUN; then return; fi
|
||||
if ! $PVE_SSH "true" 2>/dev/null; then
|
||||
PVE_REACHABLE=false
|
||||
add_check "pve-connectivity" "fail" "PVE host (192.168.1.127) unreachable via SSH"
|
||||
fi
|
||||
}
|
||||
|
||||
# ============================================================
|
||||
# LAYER 1: LVM Thin Snapshots
|
||||
# ============================================================
|
||||
|
||||
check_lvm_snapshot_freshness() {
|
||||
if $DRY_RUN; then add_check "lvm-snapshot-freshness" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "lvm-snapshot-freshness" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local ts
|
||||
ts=$($PVE_SSH "curl -s http://10.0.20.100:30091/metrics 2>/dev/null | grep '^lvm_snapshot_last_run_timestamp' | head -1 | awk '{print \$2}'" 2>/dev/null) || true
|
||||
|
||||
if [ -z "$ts" ] || [ "$ts" = "" ]; then
|
||||
add_check "lvm-snapshot-freshness" "fail" "No Pushgateway metric found — snapshots may have never run"
|
||||
return
|
||||
fi
|
||||
|
||||
local now age_h
|
||||
now=$(date +%s)
|
||||
age_h=$(python3 -c "print(f'{($now - $ts) / 3600:.1f}')" 2>/dev/null)
|
||||
|
||||
if python3 -c "exit(0 if ($now - $ts) < 129600 else 1)" 2>/dev/null; then # 36h
|
||||
add_check "lvm-snapshot-freshness" "ok" "Last snapshot ${age_h}h ago"
|
||||
elif python3 -c "exit(0 if ($now - $ts) < 172800 else 1)" 2>/dev/null; then # 48h
|
||||
add_check "lvm-snapshot-freshness" "warn" "Snapshot getting stale: ${age_h}h ago (threshold: 36h)"
|
||||
else
|
||||
add_check "lvm-snapshot-freshness" "fail" "Snapshot stale: ${age_h}h ago (threshold: 48h)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_lvm_snapshot_status() {
|
||||
if $DRY_RUN; then add_check "lvm-snapshot-status" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "lvm-snapshot-status" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local status
|
||||
status=$($PVE_SSH "curl -s http://10.0.20.100:30091/metrics 2>/dev/null | grep '^lvm_snapshot_last_status' | head -1 | awk '{print \$2}'" 2>/dev/null) || true
|
||||
|
||||
if [ "$status" = "0" ] || [ "$status" = "0.0" ]; then
|
||||
add_check "lvm-snapshot-status" "ok" "Last snapshot run succeeded"
|
||||
elif [ -z "$status" ]; then
|
||||
add_check "lvm-snapshot-status" "warn" "No status metric found"
|
||||
else
|
||||
add_check "lvm-snapshot-status" "fail" "Last snapshot run failed (status=$status)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_lvm_snapshot_count() {
|
||||
if $DRY_RUN; then add_check "lvm-snapshot-count" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "lvm-snapshot-count" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local count
|
||||
count=$($PVE_SSH "lvs pve 2>/dev/null | grep -c '_snap_' || echo 0" 2>/dev/null) || count=0
|
||||
|
||||
if [ "$count" -ge 50 ]; then
|
||||
add_check "lvm-snapshot-count" "ok" "${count} snapshots exist"
|
||||
elif [ "$count" -gt 0 ]; then
|
||||
add_check "lvm-snapshot-count" "warn" "Only ${count} snapshots (expected ≥50)"
|
||||
else
|
||||
add_check "lvm-snapshot-count" "fail" "No snapshots exist"
|
||||
fi
|
||||
}
|
||||
|
||||
check_lvm_thinpool_free() {
|
||||
if $DRY_RUN; then add_check "lvm-thinpool-free" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "lvm-thinpool-free" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local data_pct free_pct
|
||||
data_pct=$($PVE_SSH "lvs --noheadings --nosuffix -o data_percent pve/data 2>/dev/null | tr -d ' '" 2>/dev/null) || true
|
||||
|
||||
if [ -z "$data_pct" ]; then
|
||||
add_check "lvm-thinpool-free" "warn" "Cannot read thin pool usage"
|
||||
return
|
||||
fi
|
||||
|
||||
free_pct=$(python3 -c "print(f'{100 - $data_pct:.1f}')" 2>/dev/null)
|
||||
|
||||
if python3 -c "exit(0 if (100 - $data_pct) > 15 else 1)" 2>/dev/null; then
|
||||
add_check "lvm-thinpool-free" "ok" "Thin pool ${free_pct}% free"
|
||||
elif python3 -c "exit(0 if (100 - $data_pct) > 10 else 1)" 2>/dev/null; then
|
||||
add_check "lvm-thinpool-free" "warn" "Thin pool low: ${free_pct}% free (threshold: 15%)"
|
||||
else
|
||||
add_check "lvm-thinpool-free" "fail" "Thin pool critical: ${free_pct}% free (threshold: 10%)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_lvm_snapshot_timer() {
|
||||
if $DRY_RUN; then add_check "lvm-snapshot-timer" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "lvm-snapshot-timer" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local active enabled
|
||||
active=$($PVE_SSH "systemctl is-active lvm-pvc-snapshot.timer 2>/dev/null" 2>/dev/null) || active="unknown"
|
||||
enabled=$($PVE_SSH "systemctl is-enabled lvm-pvc-snapshot.timer 2>/dev/null" 2>/dev/null) || enabled="unknown"
|
||||
|
||||
if [ "$active" = "active" ] && [ "$enabled" = "enabled" ]; then
|
||||
add_check "lvm-snapshot-timer" "ok" "Timer active and enabled"
|
||||
else
|
||||
add_check "lvm-snapshot-timer" "fail" "Timer: active=$active enabled=$enabled"
|
||||
if $FIX; then
|
||||
$PVE_SSH "systemctl enable --now lvm-pvc-snapshot.timer" 2>/dev/null && \
|
||||
add_check "lvm-snapshot-timer-fix" "ok" "AUTO-FIX: Timer re-enabled" || \
|
||||
add_check "lvm-snapshot-timer-fix" "fail" "AUTO-FIX: Failed to re-enable timer"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
# ============================================================
|
||||
# LAYER 2: Weekly Backup (sda)
|
||||
# ============================================================
|
||||
|
||||
check_daily_backup_freshness() {
|
||||
if $DRY_RUN; then add_check "daily-backup-freshness" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "daily-backup-freshness" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local ts
|
||||
ts=$($PVE_SSH "curl -s http://10.0.20.100:30091/metrics 2>/dev/null | grep '^daily_backup_last_run_timestamp' | head -1 | awk '{print \$2}'" 2>/dev/null) || true
|
||||
|
||||
if [ -z "$ts" ]; then
|
||||
add_check "daily-backup-freshness" "fail" "No weekly backup metric — may have never run"
|
||||
return
|
||||
fi
|
||||
|
||||
local now age_h
|
||||
now=$(date +%s)
|
||||
age_h=$(python3 -c "print(f'{($now - $ts) / 3600:.1f}')" 2>/dev/null)
|
||||
|
||||
if python3 -c "exit(0 if ($now - $ts) < 777600 else 1)" 2>/dev/null; then # 9d
|
||||
add_check "daily-backup-freshness" "ok" "Last run ${age_h}h ago"
|
||||
else
|
||||
add_check "daily-backup-freshness" "fail" "Daily backup stale: ${age_h}h ago (threshold: 9d)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_daily_backup_status() {
|
||||
if $DRY_RUN; then add_check "daily-backup-status" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "daily-backup-status" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local status
|
||||
status=$($PVE_SSH "curl -s http://10.0.20.100:30091/metrics 2>/dev/null | grep '^daily_backup_last_status' | head -1 | awk '{print \$2}'" 2>/dev/null) || true
|
||||
|
||||
if [ "$status" = "0" ] || [ "$status" = "0.0" ]; then
|
||||
add_check "daily-backup-status" "ok" "Last weekly backup succeeded"
|
||||
elif [ -z "$status" ]; then
|
||||
add_check "daily-backup-status" "warn" "No status metric found"
|
||||
else
|
||||
add_check "daily-backup-status" "fail" "Last weekly backup failed (status=$status)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_daily_backup_timer() {
|
||||
if $DRY_RUN; then add_check "daily-backup-timer" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "daily-backup-timer" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local active enabled
|
||||
active=$($PVE_SSH "systemctl is-active daily-backup.timer 2>/dev/null" 2>/dev/null) || active="unknown"
|
||||
enabled=$($PVE_SSH "systemctl is-enabled daily-backup.timer 2>/dev/null" 2>/dev/null) || enabled="unknown"
|
||||
|
||||
if [ "$active" = "active" ] && [ "$enabled" = "enabled" ]; then
|
||||
add_check "daily-backup-timer" "ok" "Timer active and enabled"
|
||||
else
|
||||
add_check "daily-backup-timer" "fail" "Timer: active=$active enabled=$enabled"
|
||||
if $FIX; then
|
||||
$PVE_SSH "systemctl enable --now daily-backup.timer" 2>/dev/null && \
|
||||
add_check "daily-backup-timer-fix" "ok" "AUTO-FIX: Timer re-enabled" || \
|
||||
add_check "daily-backup-timer-fix" "fail" "AUTO-FIX: Failed to re-enable timer"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
check_sda_mount() {
|
||||
if $DRY_RUN; then add_check "sda-mount" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "sda-mount" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
if $PVE_SSH "mountpoint -q /mnt/backup" 2>/dev/null; then
|
||||
add_check "sda-mount" "ok" "/mnt/backup is mounted"
|
||||
else
|
||||
add_check "sda-mount" "fail" "/mnt/backup is NOT mounted"
|
||||
if $FIX; then
|
||||
$PVE_SSH "mount /mnt/backup" 2>/dev/null && \
|
||||
add_check "sda-mount-fix" "ok" "AUTO-FIX: Mounted /mnt/backup" || \
|
||||
add_check "sda-mount-fix" "fail" "AUTO-FIX: Failed to mount /mnt/backup"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
check_sda_disk_usage() {
|
||||
if $DRY_RUN; then add_check "sda-disk-usage" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "sda-disk-usage" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local usage_pct
|
||||
usage_pct=$($PVE_SSH "df --output=pcent /mnt/backup 2>/dev/null | tail -1 | tr -d ' %'" 2>/dev/null) || true
|
||||
|
||||
if [ -z "$usage_pct" ]; then
|
||||
add_check "sda-disk-usage" "warn" "Cannot read /mnt/backup usage"
|
||||
return
|
||||
fi
|
||||
|
||||
if [ "$usage_pct" -lt 85 ]; then
|
||||
add_check "sda-disk-usage" "ok" "Backup disk ${usage_pct}% used"
|
||||
elif [ "$usage_pct" -lt 95 ]; then
|
||||
add_check "sda-disk-usage" "warn" "Backup disk ${usage_pct}% used (threshold: 85%)"
|
||||
else
|
||||
add_check "sda-disk-usage" "fail" "Backup disk ${usage_pct}% used (threshold: 95%)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_pvc_data_freshness() {
|
||||
if $DRY_RUN; then add_check "pvc-data-freshness" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "pvc-data-freshness" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local latest_week count
|
||||
latest_week=$($PVE_SSH "ls -1d /mnt/backup/pvc-data/????-?? 2>/dev/null | tail -1" 2>/dev/null) || true
|
||||
count=$($PVE_SSH "ls -1d /mnt/backup/pvc-data/????-??/*/* 2>/dev/null | wc -l" 2>/dev/null) || count=0
|
||||
|
||||
if [ -z "$latest_week" ]; then
|
||||
add_check "pvc-data-freshness" "fail" "No PVC file copies found on sda"
|
||||
else
|
||||
local week_name age_days
|
||||
week_name=$(basename "$latest_week")
|
||||
# Check age of latest week dir
|
||||
age_days=$($PVE_SSH "echo \$(( (\$(date +%s) - \$(stat -c %Y '$latest_week')) / 86400 ))" 2>/dev/null) || age_days=999
|
||||
if [ "$age_days" -lt 9 ]; then
|
||||
add_check "pvc-data-freshness" "ok" "PVC copies: week ${week_name}, ${count} PVCs, ${age_days}d old"
|
||||
else
|
||||
add_check "pvc-data-freshness" "fail" "PVC copies stale: week ${week_name}, ${age_days}d old (threshold: 9d)"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
check_nfs_mirror_freshness() {
|
||||
if $DRY_RUN; then add_check "nfs-mirror-freshness" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "nfs-mirror-freshness" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local dir_count age_days
|
||||
dir_count=$($PVE_SSH "ls -1d /mnt/backup/nfs-mirror/*-backup 2>/dev/null | wc -l" 2>/dev/null) || dir_count=0
|
||||
age_days=$($PVE_SSH "echo \$(( (\$(date +%s) - \$(stat -c %Y /mnt/backup/nfs-mirror 2>/dev/null || echo 0)) / 86400 ))" 2>/dev/null) || age_days=999
|
||||
|
||||
if [ "$dir_count" -gt 0 ] && [ "$age_days" -lt 9 ]; then
|
||||
add_check "nfs-mirror-freshness" "ok" "NFS mirror: ${dir_count} dirs, ${age_days}d old"
|
||||
elif [ "$dir_count" -eq 0 ]; then
|
||||
add_check "nfs-mirror-freshness" "fail" "No NFS mirror dirs found on sda"
|
||||
else
|
||||
add_check "nfs-mirror-freshness" "fail" "NFS mirror stale: ${age_days}d old (threshold: 9d)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_pfsense_backup_freshness() {
|
||||
if $DRY_RUN; then add_check "pfsense-backup-freshness" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "pfsense-backup-freshness" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local latest age_days
|
||||
latest=$($PVE_SSH "ls -t /mnt/backup/pfsense/config-*.xml 2>/dev/null | head -1" 2>/dev/null) || true
|
||||
|
||||
if [ -z "$latest" ]; then
|
||||
add_check "pfsense-backup-freshness" "fail" "No pfsense config.xml backups found"
|
||||
return
|
||||
fi
|
||||
|
||||
age_days=$($PVE_SSH "echo \$(( (\$(date +%s) - \$(stat -c %Y '$latest')) / 86400 ))" 2>/dev/null) || age_days=999
|
||||
local fname
|
||||
fname=$(basename "$latest")
|
||||
|
||||
if [ "$age_days" -lt 9 ]; then
|
||||
add_check "pfsense-backup-freshness" "ok" "pfsense backup: ${fname}, ${age_days}d old"
|
||||
else
|
||||
add_check "pfsense-backup-freshness" "fail" "pfsense backup stale: ${fname}, ${age_days}d old (threshold: 9d)"
|
||||
fi
|
||||
}
|
||||
|
||||
# ============================================================
|
||||
# LAYER 3: Offsite Sync
|
||||
# ============================================================
|
||||
|
||||
check_offsite_sync_freshness() {
|
||||
if $DRY_RUN; then add_check "offsite-sync-freshness" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "offsite-sync-freshness" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local ts
|
||||
ts=$($PVE_SSH "curl -s http://10.0.20.100:30091/metrics 2>/dev/null | grep 'backup_last_success_timestamp.*offsite-backup-sync' | awk '{print \$NF}'" 2>/dev/null) || true
|
||||
|
||||
if [ -z "$ts" ]; then
|
||||
add_check "offsite-sync-freshness" "fail" "No offsite sync metric — may have never run"
|
||||
return
|
||||
fi
|
||||
|
||||
local now age_h
|
||||
now=$(date +%s)
|
||||
age_h=$(python3 -c "print(f'{($now - $ts) / 3600:.1f}')" 2>/dev/null)
|
||||
|
||||
if python3 -c "exit(0 if ($now - $ts) < 777600 else 1)" 2>/dev/null; then # 9d
|
||||
add_check "offsite-sync-freshness" "ok" "Last offsite sync ${age_h}h ago"
|
||||
else
|
||||
add_check "offsite-sync-freshness" "fail" "Offsite sync stale: ${age_h}h ago (threshold: 9d)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_offsite_sync_status() {
|
||||
if $DRY_RUN; then add_check "offsite-sync-status" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "offsite-sync-status" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local status
|
||||
status=$($PVE_SSH "curl -s http://10.0.20.100:30091/metrics 2>/dev/null | grep '^offsite_sync_last_status' | head -1 | awk '{print \$2}'" 2>/dev/null) || true
|
||||
|
||||
if [ "$status" = "0" ] || [ "$status" = "0.0" ]; then
|
||||
add_check "offsite-sync-status" "ok" "Last offsite sync succeeded"
|
||||
elif [ -z "$status" ]; then
|
||||
add_check "offsite-sync-status" "warn" "No offsite sync status metric"
|
||||
else
|
||||
add_check "offsite-sync-status" "fail" "Last offsite sync failed (status=$status)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_offsite_sync_timer() {
|
||||
if $DRY_RUN; then add_check "offsite-sync-timer" "ok" "DRY RUN"; return; fi
|
||||
if ! $PVE_REACHABLE; then add_check "offsite-sync-timer" "fail" "PVE unreachable"; return; fi
|
||||
|
||||
local active enabled
|
||||
active=$($PVE_SSH "systemctl is-active offsite-sync-backup.timer 2>/dev/null" 2>/dev/null) || active="unknown"
|
||||
enabled=$($PVE_SSH "systemctl is-enabled offsite-sync-backup.timer 2>/dev/null" 2>/dev/null) || enabled="unknown"
|
||||
|
||||
if [ "$active" = "active" ] && [ "$enabled" = "enabled" ]; then
|
||||
add_check "offsite-sync-timer" "ok" "Timer active and enabled"
|
||||
else
|
||||
add_check "offsite-sync-timer" "fail" "Timer: active=$active enabled=$enabled"
|
||||
if $FIX; then
|
||||
$PVE_SSH "systemctl enable --now offsite-sync-backup.timer" 2>/dev/null && \
|
||||
add_check "offsite-sync-timer-fix" "ok" "AUTO-FIX: Timer re-enabled" || \
|
||||
add_check "offsite-sync-timer-fix" "fail" "AUTO-FIX: Failed to re-enable timer"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
# ============================================================
|
||||
# DB BACKUP CRONJOBS
|
||||
# ============================================================
|
||||
|
||||
check_backup_cronjobs() {
|
||||
if $DRY_RUN; then add_check "backup-cronjobs" "ok" "DRY RUN"; return; fi
|
||||
|
||||
local report
|
||||
report=$($KUBECTL get cronjobs --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
from datetime import datetime, timezone
|
||||
|
||||
data = json.load(sys.stdin)
|
||||
# CronJobs with backup-related names
|
||||
backup_cjs = []
|
||||
for cj in data.get('items', []):
|
||||
name = cj['metadata']['name']
|
||||
ns = cj['metadata']['namespace']
|
||||
if any(k in name.lower() for k in ['backup', 'etcd', 'raft']):
|
||||
backup_cjs.append(cj)
|
||||
|
||||
if not backup_cjs:
|
||||
print('WARN|No backup CronJobs found')
|
||||
sys.exit(0)
|
||||
|
||||
# Thresholds in hours
|
||||
thresholds = {
|
||||
'mysql': 36, 'postgresql': 36, 'immich': 36,
|
||||
'vault': 216, 'etcd': 216, 'redis': 216,
|
||||
'vaultwarden': 216, 'plotting': 216, 'headscale': 216,
|
||||
'prometheus': 840, # 35 days
|
||||
}
|
||||
|
||||
results = []
|
||||
all_ok = True
|
||||
now = datetime.now(timezone.utc)
|
||||
for cj in backup_cjs:
|
||||
ns = cj['metadata']['namespace']
|
||||
name = cj['metadata']['name']
|
||||
last_success = cj.get('status', {}).get('lastSuccessfulTime', '')
|
||||
suspend = cj.get('spec', {}).get('suspend', False)
|
||||
|
||||
# Find matching threshold
|
||||
threshold_h = 216 # default 9 days
|
||||
for key, th in thresholds.items():
|
||||
if key in name.lower():
|
||||
threshold_h = th
|
||||
break
|
||||
|
||||
if suspend:
|
||||
all_ok = False
|
||||
results.append(f'FAIL {ns}/{name}: SUSPENDED')
|
||||
continue
|
||||
|
||||
if not last_success:
|
||||
results.append(f'WARN {ns}/{name}: never succeeded')
|
||||
all_ok = False
|
||||
continue
|
||||
|
||||
try:
|
||||
dt = datetime.fromisoformat(last_success.replace('Z', '+00:00'))
|
||||
age_h = (now - dt).total_seconds() / 3600
|
||||
if age_h > threshold_h:
|
||||
all_ok = False
|
||||
results.append(f'FAIL {ns}/{name}: {age_h:.0f}h ago (threshold: {threshold_h}h)')
|
||||
else:
|
||||
results.append(f'OK {ns}/{name}: {age_h:.0f}h ago')
|
||||
except Exception:
|
||||
results.append(f'WARN {ns}/{name}: cannot parse time {last_success}')
|
||||
all_ok = False
|
||||
|
||||
status = 'OK' if all_ok else 'WARN'
|
||||
print(f'{status}|' + '; '.join(results))
|
||||
" 2>/dev/null) || report="WARN|Failed to check backup CronJobs"
|
||||
|
||||
local status_prefix="${report%%|*}"
|
||||
local detail="${report#*|}"
|
||||
|
||||
if [ "$status_prefix" = "OK" ]; then
|
||||
add_check "backup-cronjobs" "ok" "$detail"
|
||||
else
|
||||
add_check "backup-cronjobs" "warn" "$detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# ============================================================
|
||||
# CNPG BACKUPS (existing checks, kept as-is)
|
||||
# ============================================================
|
||||
|
||||
check_cnpg_backups() {
|
||||
if $DRY_RUN; then add_check "cnpg-backups" "ok" "DRY RUN"; return; fi
|
||||
|
||||
local backups
|
||||
backups=$($KUBECTL get backup.postgresql.cnpg.io --all-namespaces -o json 2>/dev/null) || {
|
||||
add_check "cnpg-backups" "warn" "No CNPG Backup CRDs found"
|
||||
return
|
||||
}
|
||||
|
||||
local report
|
||||
report=$(echo "$backups" | python3 -c "
|
||||
import sys, json
|
||||
from datetime import datetime, timezone
|
||||
|
||||
data = json.load(sys.stdin)
|
||||
items = data.get('items', [])
|
||||
if not items:
|
||||
print('WARN|No CNPG backups found')
|
||||
sys.exit(0)
|
||||
|
||||
clusters = {}
|
||||
for b in items:
|
||||
ns = b['metadata']['namespace']
|
||||
cluster = b.get('spec', {}).get('cluster', {}).get('name', 'unknown')
|
||||
key = f'{ns}/{cluster}'
|
||||
stopped = b.get('status', {}).get('stoppedAt', '')
|
||||
phase = b.get('status', {}).get('phase', 'unknown')
|
||||
if key not in clusters or stopped > clusters[key].get('stopped', ''):
|
||||
clusters[key] = {'phase': phase, 'stopped': stopped}
|
||||
|
||||
results = []
|
||||
all_ok = True
|
||||
now = datetime.now(timezone.utc)
|
||||
for key, info in sorted(clusters.items()):
|
||||
if info['stopped']:
|
||||
try:
|
||||
dt = datetime.fromisoformat(info['stopped'].replace('Z', '+00:00'))
|
||||
age_h = (now - dt).total_seconds() / 3600
|
||||
if age_h > 48: all_ok = False
|
||||
results.append(f'{key}: {info[\"phase\"]} ({age_h:.1f}h ago)')
|
||||
except: results.append(f'{key}: {info[\"phase\"]}'); all_ok = False
|
||||
else:
|
||||
results.append(f'{key}: {info[\"phase\"]} (no completion)'); all_ok = False
|
||||
|
||||
print(f'{\"OK\" if all_ok else \"WARN\"}|' + '; '.join(results))
|
||||
" 2>/dev/null) || report="WARN|Failed to parse CNPG backups"
|
||||
|
||||
local status_prefix="${report%%|*}"
|
||||
local detail="${report#*|}"
|
||||
if [ "$status_prefix" = "OK" ]; then
|
||||
add_check "cnpg-backups" "ok" "$detail"
|
||||
else
|
||||
add_check "cnpg-backups" "warn" "$detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# ============================================================
|
||||
# RUN ALL CHECKS
|
||||
# ============================================================
|
||||
|
||||
check_pve_connectivity
|
||||
|
||||
# Layer 1: LVM Thin Snapshots
|
||||
check_lvm_snapshot_freshness
|
||||
check_lvm_snapshot_status
|
||||
check_lvm_snapshot_count
|
||||
check_lvm_thinpool_free
|
||||
check_lvm_snapshot_timer
|
||||
|
||||
# Layer 2: Weekly Backup (sda)
|
||||
check_daily_backup_freshness
|
||||
check_daily_backup_status
|
||||
check_daily_backup_timer
|
||||
check_sda_mount
|
||||
check_sda_disk_usage
|
||||
check_pvc_data_freshness
|
||||
check_nfs_mirror_freshness
|
||||
check_pfsense_backup_freshness
|
||||
|
||||
# Layer 3: Offsite Sync
|
||||
check_offsite_sync_freshness
|
||||
check_offsite_sync_status
|
||||
check_offsite_sync_timer
|
||||
|
||||
# DB CronJobs + CNPG
|
||||
check_backup_cronjobs
|
||||
check_cnpg_backups
|
||||
|
||||
# ============================================================
|
||||
# OUTPUT
|
||||
# ============================================================
|
||||
|
||||
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
statuses = [c['status'] for c in checks]
|
||||
if 'fail' in statuses:
|
||||
print('fail')
|
||||
elif 'warn' in statuses:
|
||||
print('warn')
|
||||
else:
|
||||
print('ok')
|
||||
")
|
||||
|
||||
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||
|
|
@ -1,166 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
AGENT="crowdsec-status"
|
||||
DRY_RUN=false
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
checks=()
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||
}
|
||||
|
||||
find_crowdsec_namespace() {
|
||||
$KUBECTL get pods -A -l app.kubernetes.io/name=crowdsec --no-headers 2>/dev/null | head -1 | awk '{print $1}' || \
|
||||
$KUBECTL get pods -A --no-headers 2>/dev/null | grep -i crowdsec | head -1 | awk '{print $1}' || \
|
||||
echo "crowdsec"
|
||||
}
|
||||
|
||||
check_lapi_health() {
|
||||
if $DRY_RUN; then
|
||||
add_check "crowdsec-lapi" "ok" "dry-run: would check CrowdSec LAPI pod health"
|
||||
return
|
||||
fi
|
||||
|
||||
local ns
|
||||
ns=$(find_crowdsec_namespace)
|
||||
|
||||
local lapi_pod
|
||||
lapi_pod=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/name=crowdsec,app.kubernetes.io/component=lapi --no-headers 2>/dev/null | head -1) || true
|
||||
|
||||
if [ -z "$lapi_pod" ]; then
|
||||
lapi_pod=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i "crowdsec.*lapi" | head -1) || true
|
||||
fi
|
||||
|
||||
if [ -z "$lapi_pod" ]; then
|
||||
add_check "crowdsec-lapi" "fail" "No CrowdSec LAPI pod found in namespace ${ns}"
|
||||
return
|
||||
fi
|
||||
|
||||
local pod_name status
|
||||
pod_name=$(echo "$lapi_pod" | awk '{print $1}')
|
||||
status=$(echo "$lapi_pod" | awk '{print $3}')
|
||||
|
||||
if [ "$status" != "Running" ]; then
|
||||
add_check "crowdsec-lapi" "fail" "LAPI pod ${pod_name} is ${status}"
|
||||
return
|
||||
fi
|
||||
|
||||
add_check "crowdsec-lapi" "ok" "LAPI pod ${pod_name} is Running"
|
||||
}
|
||||
|
||||
check_cscli_metrics() {
|
||||
if $DRY_RUN; then
|
||||
add_check "crowdsec-metrics" "ok" "dry-run: would run cscli metrics via kubectl exec"
|
||||
return
|
||||
fi
|
||||
|
||||
local ns
|
||||
ns=$(find_crowdsec_namespace)
|
||||
|
||||
local lapi_pod
|
||||
lapi_pod=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/name=crowdsec,app.kubernetes.io/component=lapi -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) || \
|
||||
lapi_pod=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i "crowdsec.*lapi" | head -1 | awk '{print $1}') || true
|
||||
|
||||
if [ -z "$lapi_pod" ]; then
|
||||
add_check "crowdsec-metrics" "warn" "No LAPI pod found to run cscli metrics"
|
||||
return
|
||||
fi
|
||||
|
||||
local metrics_output
|
||||
metrics_output=$($KUBECTL exec -n "$ns" "$lapi_pod" -- cscli metrics 2>/dev/null) || {
|
||||
add_check "crowdsec-metrics" "warn" "Failed to run cscli metrics on ${lapi_pod}"
|
||||
return
|
||||
}
|
||||
|
||||
add_check "crowdsec-metrics" "ok" "cscli metrics returned successfully"
|
||||
}
|
||||
|
||||
check_decisions() {
|
||||
if $DRY_RUN; then
|
||||
add_check "crowdsec-decisions" "ok" "dry-run: would check cscli decisions list"
|
||||
return
|
||||
fi
|
||||
|
||||
local ns
|
||||
ns=$(find_crowdsec_namespace)
|
||||
|
||||
local lapi_pod
|
||||
lapi_pod=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/name=crowdsec,app.kubernetes.io/component=lapi -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) || \
|
||||
lapi_pod=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i "crowdsec.*lapi" | head -1 | awk '{print $1}') || true
|
||||
|
||||
if [ -z "$lapi_pod" ]; then
|
||||
add_check "crowdsec-decisions" "warn" "No LAPI pod found to check decisions"
|
||||
return
|
||||
fi
|
||||
|
||||
local decisions
|
||||
decisions=$($KUBECTL exec -n "$ns" "$lapi_pod" -- cscli decisions list -o json 2>/dev/null) || {
|
||||
add_check "crowdsec-decisions" "ok" "No active decisions (or failed to query)"
|
||||
return
|
||||
}
|
||||
|
||||
local count
|
||||
count=$(echo "$decisions" | jq 'if type == "array" then length else 0 end' 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$count" -gt 0 ]; then
|
||||
add_check "crowdsec-decisions" "ok" "${count} active decision(s)"
|
||||
else
|
||||
add_check "crowdsec-decisions" "ok" "No active decisions"
|
||||
fi
|
||||
}
|
||||
|
||||
check_agent_daemonset() {
|
||||
if $DRY_RUN; then
|
||||
add_check "crowdsec-agents" "ok" "dry-run: would check CrowdSec agent DaemonSet"
|
||||
return
|
||||
fi
|
||||
|
||||
local ns
|
||||
ns=$(find_crowdsec_namespace)
|
||||
|
||||
local ds_json
|
||||
ds_json=$($KUBECTL get daemonset -n "$ns" -l app.kubernetes.io/name=crowdsec -o json 2>/dev/null) || {
|
||||
# Fallback: search by name
|
||||
ds_json=$($KUBECTL get daemonset -n "$ns" -o json 2>/dev/null | jq '{items: [.items[] | select(.metadata.name | test("crowdsec"))]}') || {
|
||||
add_check "crowdsec-agents" "warn" "No CrowdSec DaemonSet found"
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
local desired ready
|
||||
desired=$(echo "$ds_json" | jq '[.items[].status.desiredNumberScheduled] | add // 0' 2>/dev/null || echo "0")
|
||||
ready=$(echo "$ds_json" | jq '[.items[].status.numberReady] | add // 0' 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$ready" -lt "$desired" ]; then
|
||||
add_check "crowdsec-agents" "warn" "CrowdSec agents: ${ready}/${desired} ready"
|
||||
elif [ "$desired" -eq 0 ]; then
|
||||
add_check "crowdsec-agents" "warn" "No CrowdSec agent DaemonSet pods scheduled"
|
||||
else
|
||||
add_check "crowdsec-agents" "ok" "CrowdSec agents: ${ready}/${desired} ready"
|
||||
fi
|
||||
}
|
||||
|
||||
check_lapi_health
|
||||
check_cscli_metrics
|
||||
check_decisions
|
||||
check_agent_daemonset
|
||||
|
||||
# Output JSON
|
||||
overall="ok"
|
||||
for c in "${checks[@]}"; do
|
||||
s=$(echo "$c" | jq -r '.status')
|
||||
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||
done
|
||||
|
||||
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||
|
|
@ -1,194 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
DRY_RUN=false
|
||||
AGENT="db-health"
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
CHECKS="[]"
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||
json.dump(checks, sys.stdout)
|
||||
")
|
||||
}
|
||||
|
||||
# MySQL InnoDB Cluster - Group Replication status
|
||||
check_mysql_gr() {
|
||||
if $DRY_RUN; then
|
||||
add_check "mysql-group-replication" "ok" "DRY RUN: would check MySQL Group Replication status"
|
||||
return
|
||||
fi
|
||||
|
||||
# Discover MySQL pod via labels first, fall back to known name
|
||||
local mysql_pod
|
||||
mysql_pod=$($KUBECTL get pods -n dbaas -l app=mysql-cluster -o name 2>/dev/null | head -1) || true
|
||||
if [ -z "$mysql_pod" ]; then
|
||||
mysql_pod=$($KUBECTL get pods -n dbaas -l app.kubernetes.io/name=mysql -o name 2>/dev/null | head -1) || true
|
||||
fi
|
||||
if [ -z "$mysql_pod" ]; then
|
||||
mysql_pod="sts/mysql-cluster"
|
||||
fi
|
||||
|
||||
local gr_status
|
||||
gr_status=$($KUBECTL exec "$mysql_pod" -n dbaas -- mysql -N -e \
|
||||
"SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members" 2>/dev/null) || {
|
||||
add_check "mysql-group-replication" "fail" "Cannot connect to MySQL cluster to check GR status"
|
||||
return
|
||||
}
|
||||
|
||||
local member_count online_count
|
||||
member_count=$(echo "$gr_status" | grep -c . || true)
|
||||
online_count=$(echo "$gr_status" | grep -c "ONLINE" || true)
|
||||
|
||||
if [ "$online_count" -eq "$member_count" ] && [ "$member_count" -ge 3 ]; then
|
||||
add_check "mysql-group-replication" "ok" "All $member_count members ONLINE: $(echo "$gr_status" | tr '\t' ' ' | tr '\n' '; ')"
|
||||
elif [ "$online_count" -lt "$member_count" ]; then
|
||||
add_check "mysql-group-replication" "fail" "Only $online_count/$member_count members ONLINE: $(echo "$gr_status" | tr '\t' ' ' | tr '\n' '; ')"
|
||||
else
|
||||
add_check "mysql-group-replication" "warn" "Cluster has $member_count members (expected 3): $(echo "$gr_status" | tr '\t' ' ' | tr '\n' '; ')"
|
||||
fi
|
||||
}
|
||||
|
||||
# MySQL pod health
|
||||
check_mysql_pods() {
|
||||
if $DRY_RUN; then
|
||||
add_check "mysql-pods" "ok" "DRY RUN: would check MySQL pod status"
|
||||
return
|
||||
fi
|
||||
|
||||
local pod_status
|
||||
pod_status=$($KUBECTL get pods -n dbaas -l app=mysql-cluster -o wide --no-headers 2>/dev/null) || \
|
||||
pod_status=$($KUBECTL get pods -n dbaas --no-headers 2>/dev/null | grep -i mysql) || {
|
||||
add_check "mysql-pods" "warn" "Cannot find MySQL pods in dbaas namespace"
|
||||
return
|
||||
}
|
||||
|
||||
local not_running
|
||||
not_running=$(echo "$pod_status" | grep -v "Running" | grep -v "Completed" || true)
|
||||
|
||||
if [ -z "$not_running" ]; then
|
||||
local count
|
||||
count=$(echo "$pod_status" | grep -c "Running" || true)
|
||||
add_check "mysql-pods" "ok" "$count MySQL pod(s) running in dbaas namespace"
|
||||
else
|
||||
add_check "mysql-pods" "fail" "Unhealthy MySQL pods: $(echo "$not_running" | awk '{print $1": "$3}' | tr '\n' '; ')"
|
||||
fi
|
||||
}
|
||||
|
||||
# CNPG PostgreSQL cluster health
|
||||
check_cnpg() {
|
||||
if $DRY_RUN; then
|
||||
add_check "cnpg-clusters" "ok" "DRY RUN: would check CNPG PostgreSQL cluster health"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check if CNPG CRDs exist
|
||||
local cnpg_clusters
|
||||
cnpg_clusters=$($KUBECTL get cluster.postgresql.cnpg.io --all-namespaces -o json 2>/dev/null) || {
|
||||
add_check "cnpg-clusters" "warn" "CNPG CRD not found or no clusters deployed"
|
||||
return
|
||||
}
|
||||
|
||||
local report
|
||||
report=$(echo "$cnpg_clusters" | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
results = []
|
||||
all_healthy = True
|
||||
for cluster in data.get('items', []):
|
||||
ns = cluster['metadata']['namespace']
|
||||
name = cluster['metadata']['name']
|
||||
phase = cluster.get('status', {}).get('phase', 'unknown')
|
||||
ready = cluster.get('status', {}).get('readyInstances', 0)
|
||||
instances = cluster.get('spec', {}).get('instances', 0)
|
||||
primary = cluster.get('status', {}).get('currentPrimary', 'unknown')
|
||||
if phase != 'Cluster in healthy state' and phase != 'Healthy':
|
||||
all_healthy = False
|
||||
if ready < instances:
|
||||
all_healthy = False
|
||||
results.append(f'{ns}/{name}: phase={phase} ready={ready}/{instances} primary={primary}')
|
||||
print('HEALTHY' if all_healthy else 'UNHEALTHY')
|
||||
print('; '.join(results))
|
||||
" 2>/dev/null) || report="Failed to parse CNPG status"
|
||||
|
||||
local health_line
|
||||
health_line=$(echo "$report" | head -1)
|
||||
local detail_line
|
||||
detail_line=$(echo "$report" | tail -1)
|
||||
|
||||
if [ "$health_line" = "HEALTHY" ]; then
|
||||
add_check "cnpg-clusters" "ok" "$detail_line"
|
||||
else
|
||||
add_check "cnpg-clusters" "fail" "$detail_line"
|
||||
fi
|
||||
}
|
||||
|
||||
# Database connection counts (MySQL)
|
||||
check_mysql_connections() {
|
||||
if $DRY_RUN; then
|
||||
add_check "mysql-connections" "ok" "DRY RUN: would check MySQL connection counts"
|
||||
return
|
||||
fi
|
||||
|
||||
local mysql_pod
|
||||
mysql_pod=$($KUBECTL get pods -n dbaas -l app=mysql-cluster -o name 2>/dev/null | head -1) || true
|
||||
if [ -z "$mysql_pod" ]; then
|
||||
mysql_pod="sts/mysql-cluster"
|
||||
fi
|
||||
|
||||
local conn_info
|
||||
conn_info=$($KUBECTL exec "$mysql_pod" -n dbaas -- mysql -N -e \
|
||||
"SELECT 'threads_connected', VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Threads_connected' UNION ALL SELECT 'max_connections', VARIABLE_VALUE FROM performance_schema.global_variables WHERE VARIABLE_NAME='max_connections'" 2>/dev/null) || {
|
||||
add_check "mysql-connections" "warn" "Cannot query MySQL connection info"
|
||||
return
|
||||
}
|
||||
|
||||
local threads_connected max_connections
|
||||
threads_connected=$(echo "$conn_info" | grep threads_connected | awk '{print $2}') || threads_connected="unknown"
|
||||
max_connections=$(echo "$conn_info" | grep max_connections | awk '{print $2}') || max_connections="unknown"
|
||||
|
||||
if [ "$threads_connected" != "unknown" ] && [ "$max_connections" != "unknown" ]; then
|
||||
local pct=$((threads_connected * 100 / max_connections))
|
||||
if [ "$pct" -gt 80 ]; then
|
||||
add_check "mysql-connections" "fail" "MySQL connections at ${pct}%: $threads_connected/$max_connections"
|
||||
elif [ "$pct" -gt 60 ]; then
|
||||
add_check "mysql-connections" "warn" "MySQL connections at ${pct}%: $threads_connected/$max_connections"
|
||||
else
|
||||
add_check "mysql-connections" "ok" "MySQL connections: $threads_connected/$max_connections (${pct}%)"
|
||||
fi
|
||||
else
|
||||
add_check "mysql-connections" "warn" "MySQL connections: threads=$threads_connected max=$max_connections"
|
||||
fi
|
||||
}
|
||||
|
||||
# Run all checks
|
||||
check_mysql_gr
|
||||
check_mysql_pods
|
||||
check_cnpg
|
||||
check_mysql_connections
|
||||
|
||||
# Determine overall status
|
||||
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
statuses = [c['status'] for c in checks]
|
||||
if 'fail' in statuses:
|
||||
print('fail')
|
||||
elif 'warn' in statuses:
|
||||
print('warn')
|
||||
else:
|
||||
print('ok')
|
||||
")
|
||||
|
||||
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||
|
|
@ -1,217 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
DRY_RUN=false
|
||||
AGENT="deploy-status"
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
CHECKS="[]"
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||
json.dump(checks, sys.stdout)
|
||||
")
|
||||
}
|
||||
|
||||
# Check for stalled rollouts (Progressing=False or deadline exceeded)
|
||||
check_stalled_rollouts() {
|
||||
if $DRY_RUN; then
|
||||
add_check "stalled-rollouts" "ok" "DRY RUN: would check for stalled deployment rollouts"
|
||||
return
|
||||
fi
|
||||
|
||||
local stalled
|
||||
stalled=$($KUBECTL get deployments --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
stalled = []
|
||||
for dep in data.get('items', []):
|
||||
ns = dep['metadata']['namespace']
|
||||
name = dep['metadata']['name']
|
||||
conditions = dep.get('status', {}).get('conditions', [])
|
||||
for cond in conditions:
|
||||
if cond.get('type') == 'Progressing' and cond.get('status') == 'False':
|
||||
reason = cond.get('reason', 'unknown')
|
||||
stalled.append(f'{ns}/{name}: {reason}')
|
||||
elif cond.get('type') == 'Available' and cond.get('status') == 'False':
|
||||
reason = cond.get('reason', 'unknown')
|
||||
stalled.append(f'{ns}/{name}: unavailable ({reason})')
|
||||
if stalled:
|
||||
print('; '.join(stalled))
|
||||
else:
|
||||
print('')
|
||||
" 2>/dev/null) || stalled="Failed to check deployments"
|
||||
|
||||
if [ -z "$stalled" ]; then
|
||||
add_check "stalled-rollouts" "ok" "No stalled rollouts detected"
|
||||
else
|
||||
add_check "stalled-rollouts" "fail" "Stalled rollouts: $stalled"
|
||||
fi
|
||||
}
|
||||
|
||||
# Check for unavailable replicas
|
||||
check_unavailable_replicas() {
|
||||
if $DRY_RUN; then
|
||||
add_check "unavailable-replicas" "ok" "DRY RUN: would check for deployments with unavailable replicas"
|
||||
return
|
||||
fi
|
||||
|
||||
local unavail
|
||||
unavail=$($KUBECTL get deployments --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
issues = []
|
||||
for dep in data.get('items', []):
|
||||
ns = dep['metadata']['namespace']
|
||||
name = dep['metadata']['name']
|
||||
spec_replicas = dep.get('spec', {}).get('replicas', 1)
|
||||
ready = dep.get('status', {}).get('readyReplicas', 0) or 0
|
||||
unavailable = dep.get('status', {}).get('unavailableReplicas', 0) or 0
|
||||
if unavailable > 0 or ready < spec_replicas:
|
||||
issues.append(f'{ns}/{name}: {ready}/{spec_replicas} ready, {unavailable} unavailable')
|
||||
if issues:
|
||||
print('; '.join(issues))
|
||||
else:
|
||||
print('')
|
||||
" 2>/dev/null) || unavail="Failed to check replicas"
|
||||
|
||||
if [ -z "$unavail" ]; then
|
||||
add_check "unavailable-replicas" "ok" "All deployments have desired replicas ready"
|
||||
else
|
||||
add_check "unavailable-replicas" "warn" "Unavailable replicas: $unavail"
|
||||
fi
|
||||
}
|
||||
|
||||
# Check for image pull errors
|
||||
check_image_pull_errors() {
|
||||
if $DRY_RUN; then
|
||||
add_check "image-pull-errors" "ok" "DRY RUN: would check for ImagePullBackOff/ErrImagePull pods"
|
||||
return
|
||||
fi
|
||||
|
||||
local pull_errors
|
||||
pull_errors=$($KUBECTL get pods --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
errors = []
|
||||
for pod in data.get('items', []):
|
||||
ns = pod['metadata']['namespace']
|
||||
name = pod['metadata']['name']
|
||||
for cs in pod.get('status', {}).get('containerStatuses', []) + pod.get('status', {}).get('initContainerStatuses', []):
|
||||
waiting = cs.get('state', {}).get('waiting', {})
|
||||
reason = waiting.get('reason', '')
|
||||
if reason in ('ImagePullBackOff', 'ErrImagePull', 'InvalidImageName'):
|
||||
image = cs.get('image', 'unknown')
|
||||
msg = waiting.get('message', '')[:100]
|
||||
errors.append(f'{ns}/{name}: {reason} image={image} ({msg})')
|
||||
if errors:
|
||||
print('; '.join(errors))
|
||||
else:
|
||||
print('')
|
||||
" 2>/dev/null) || pull_errors="Failed to check image pulls"
|
||||
|
||||
if [ -z "$pull_errors" ]; then
|
||||
add_check "image-pull-errors" "ok" "No image pull errors found"
|
||||
else
|
||||
add_check "image-pull-errors" "fail" "Image pull errors: $pull_errors"
|
||||
fi
|
||||
}
|
||||
|
||||
# Check for recent restarts (>5 in last hour)
|
||||
check_recent_restarts() {
|
||||
if $DRY_RUN; then
|
||||
add_check "recent-restarts" "ok" "DRY RUN: would check for pods with high restart counts"
|
||||
return
|
||||
fi
|
||||
|
||||
local restarts
|
||||
restarts=$($KUBECTL get pods --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
high_restart = []
|
||||
for pod in data.get('items', []):
|
||||
ns = pod['metadata']['namespace']
|
||||
name = pod['metadata']['name']
|
||||
for cs in pod.get('status', {}).get('containerStatuses', []):
|
||||
count = cs.get('restartCount', 0)
|
||||
if count >= 5:
|
||||
container = cs['name']
|
||||
high_restart.append(f'{ns}/{name}:{container} restarts={count}')
|
||||
if high_restart:
|
||||
print('; '.join(sorted(high_restart, key=lambda x: int(x.split('=')[1]), reverse=True)[:20]))
|
||||
else:
|
||||
print('')
|
||||
" 2>/dev/null) || restarts="Failed to check restarts"
|
||||
|
||||
if [ -z "$restarts" ]; then
|
||||
add_check "recent-restarts" "ok" "No pods with 5+ restarts"
|
||||
else
|
||||
add_check "recent-restarts" "warn" "High restart counts: $restarts"
|
||||
fi
|
||||
}
|
||||
|
||||
# Check CrashLoopBackOff pods
|
||||
check_crashloop() {
|
||||
if $DRY_RUN; then
|
||||
add_check "crashloop" "ok" "DRY RUN: would check for CrashLoopBackOff pods"
|
||||
return
|
||||
fi
|
||||
|
||||
local crashloop
|
||||
crashloop=$($KUBECTL get pods --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
crashes = []
|
||||
for pod in data.get('items', []):
|
||||
ns = pod['metadata']['namespace']
|
||||
name = pod['metadata']['name']
|
||||
for cs in pod.get('status', {}).get('containerStatuses', []):
|
||||
waiting = cs.get('state', {}).get('waiting', {})
|
||||
if waiting.get('reason') == 'CrashLoopBackOff':
|
||||
container = cs['name']
|
||||
restarts = cs.get('restartCount', 0)
|
||||
crashes.append(f'{ns}/{name}:{container} restarts={restarts}')
|
||||
if crashes:
|
||||
print('; '.join(crashes))
|
||||
else:
|
||||
print('')
|
||||
" 2>/dev/null) || crashloop="Failed to check crashloop"
|
||||
|
||||
if [ -z "$crashloop" ]; then
|
||||
add_check "crashloop" "ok" "No CrashLoopBackOff pods"
|
||||
else
|
||||
add_check "crashloop" "fail" "CrashLoopBackOff: $crashloop"
|
||||
fi
|
||||
}
|
||||
|
||||
# Run all checks
|
||||
check_stalled_rollouts
|
||||
check_unavailable_replicas
|
||||
check_image_pull_errors
|
||||
check_recent_restarts
|
||||
check_crashloop
|
||||
|
||||
# Determine overall status
|
||||
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
statuses = [c['status'] for c in checks]
|
||||
if 'fail' in statuses:
|
||||
print('fail')
|
||||
elif 'warn' in statuses:
|
||||
print('warn')
|
||||
else:
|
||||
print('ok')
|
||||
")
|
||||
|
||||
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||
|
|
@ -1,144 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
AGENT="dns-check"
|
||||
DRY_RUN=false
|
||||
|
||||
# Internal DNS server (Technitium)
|
||||
INTERNAL_DNS="10.0.20.100"
|
||||
# Public DNS
|
||||
PUBLIC_DNS="1.1.1.1"
|
||||
|
||||
# Services to check
|
||||
SERVICES=(
|
||||
"grafana.viktorbarzin.me"
|
||||
"prometheus.viktorbarzin.me"
|
||||
"nextcloud.viktorbarzin.me"
|
||||
"authentik.viktorbarzin.me"
|
||||
"viktorbarzin.me"
|
||||
)
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
checks=()
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||
}
|
||||
|
||||
check_dns_resolution() {
|
||||
if $DRY_RUN; then
|
||||
add_check "dns-resolution" "ok" "dry-run: would resolve ${#SERVICES[@]} services via internal and public DNS"
|
||||
return
|
||||
fi
|
||||
|
||||
local failures=0 mismatches=0 successes=0
|
||||
local failure_details="" mismatch_details=""
|
||||
|
||||
for svc in "${SERVICES[@]}"; do
|
||||
local internal_result public_result
|
||||
|
||||
internal_result=$(dig +short "$svc" @"$INTERNAL_DNS" A 2>/dev/null | head -1) || internal_result=""
|
||||
public_result=$(dig +short "$svc" @"$PUBLIC_DNS" A 2>/dev/null | head -1) || public_result=""
|
||||
|
||||
if [ -z "$internal_result" ] && [ -z "$public_result" ]; then
|
||||
failures=$((failures + 1))
|
||||
failure_details="${failure_details}${svc} (both resolvers failed); "
|
||||
elif [ -z "$internal_result" ]; then
|
||||
failures=$((failures + 1))
|
||||
failure_details="${failure_details}${svc} (internal DNS failed); "
|
||||
elif [ -z "$public_result" ]; then
|
||||
# Public might use CNAME/proxy, not necessarily a failure
|
||||
successes=$((successes + 1))
|
||||
elif [ "$internal_result" != "$public_result" ]; then
|
||||
# Mismatch is informational — Cloudflare proxy IPs differ from internal IPs
|
||||
mismatches=$((mismatches + 1))
|
||||
mismatch_details="${mismatch_details}${svc} (internal=${internal_result} public=${public_result}); "
|
||||
successes=$((successes + 1))
|
||||
else
|
||||
successes=$((successes + 1))
|
||||
fi
|
||||
done
|
||||
|
||||
if [ "$failures" -gt 0 ]; then
|
||||
add_check "dns-resolution" "fail" "${failures} DNS failures: ${failure_details}"
|
||||
elif [ "$mismatches" -gt 0 ]; then
|
||||
add_check "dns-resolution" "ok" "${successes}/${#SERVICES[@]} resolved. ${mismatches} internal/public mismatches (expected with Cloudflare proxy): ${mismatch_details}"
|
||||
else
|
||||
add_check "dns-resolution" "ok" "All ${successes}/${#SERVICES[@]} services resolved successfully"
|
||||
fi
|
||||
}
|
||||
|
||||
check_technitium_health() {
|
||||
if $DRY_RUN; then
|
||||
add_check "technitium" "ok" "dry-run: would check Technitium DNS server pod health"
|
||||
return
|
||||
fi
|
||||
|
||||
local tech_pods
|
||||
tech_pods=$($KUBECTL get pods -A -l app.kubernetes.io/name=technitium --no-headers 2>/dev/null) || \
|
||||
tech_pods=$($KUBECTL get pods -A --no-headers 2>/dev/null | grep -i technitium || true)
|
||||
|
||||
if [ -z "$tech_pods" ]; then
|
||||
add_check "technitium" "warn" "No Technitium pods found"
|
||||
return
|
||||
fi
|
||||
|
||||
local not_running
|
||||
not_running=$(echo "$tech_pods" | grep -v "Running" | grep -c "." 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$not_running" -gt 0 ]; then
|
||||
add_check "technitium" "fail" "Technitium pod(s) not running"
|
||||
else
|
||||
add_check "technitium" "ok" "Technitium DNS server pod(s) running"
|
||||
fi
|
||||
}
|
||||
|
||||
check_coredns_health() {
|
||||
if $DRY_RUN; then
|
||||
add_check "coredns" "ok" "dry-run: would check CoreDNS pod health"
|
||||
return
|
||||
fi
|
||||
|
||||
local coredns_pods
|
||||
coredns_pods=$($KUBECTL get pods -n kube-system -l k8s-app=kube-dns --no-headers 2>/dev/null) || {
|
||||
add_check "coredns" "warn" "Failed to query CoreDNS pods"
|
||||
return
|
||||
}
|
||||
|
||||
if [ -z "$coredns_pods" ]; then
|
||||
add_check "coredns" "warn" "No CoreDNS pods found"
|
||||
return
|
||||
fi
|
||||
|
||||
local total not_running
|
||||
total=$(echo "$coredns_pods" | grep -c "." 2>/dev/null || echo "0")
|
||||
not_running=$(echo "$coredns_pods" | grep -v "Running" | grep -c "." 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$not_running" -gt 0 ]; then
|
||||
add_check "coredns" "fail" "${not_running}/${total} CoreDNS pod(s) not running"
|
||||
else
|
||||
add_check "coredns" "ok" "All ${total} CoreDNS pod(s) running"
|
||||
fi
|
||||
}
|
||||
|
||||
check_dns_resolution
|
||||
check_technitium_health
|
||||
check_coredns_health
|
||||
|
||||
# Output JSON
|
||||
overall="ok"
|
||||
for c in "${checks[@]}"; do
|
||||
s=$(echo "$c" | jq -r '.status')
|
||||
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||
done
|
||||
|
||||
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||
|
|
@ -1,281 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
AGENT="monitoring-health"
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
MONITORING_NS="monitoring"
|
||||
DRY_RUN=false
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
checks=()
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||
}
|
||||
|
||||
check_prometheus() {
|
||||
if $DRY_RUN; then
|
||||
add_check "prometheus" "ok" "dry-run: would check Prometheus server health"
|
||||
return
|
||||
fi
|
||||
|
||||
# Discover Prometheus server pod via labels
|
||||
local prom_pod
|
||||
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app.kubernetes.io/name=prometheus,app.kubernetes.io/component=server -o name 2>/dev/null | head -1)
|
||||
if [ -z "$prom_pod" ]; then
|
||||
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app=prometheus,component=server -o name 2>/dev/null | head -1)
|
||||
fi
|
||||
if [ -z "$prom_pod" ]; then
|
||||
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep prometheus-server | head -1)
|
||||
fi
|
||||
|
||||
if [ -z "$prom_pod" ]; then
|
||||
add_check "prometheus" "fail" "No Prometheus server pod found in $MONITORING_NS"
|
||||
return
|
||||
fi
|
||||
|
||||
local phase
|
||||
phase=$($KUBECTL get "$prom_pod" -n "$MONITORING_NS" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||
if [ "$phase" != "Running" ]; then
|
||||
add_check "prometheus" "fail" "Prometheus server pod phase: $phase"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check Prometheus is responding
|
||||
local prom_healthy
|
||||
prom_healthy=$($KUBECTL exec "$prom_pod" -n "$MONITORING_NS" -c prometheus-server -- \
|
||||
wget -q -O- "http://localhost:9090/-/healthy" 2>/dev/null || echo "unhealthy")
|
||||
|
||||
if echo "$prom_healthy" | grep -qi "ok\|healthy"; then
|
||||
# Check target scraping
|
||||
local targets_up
|
||||
targets_up=$($KUBECTL exec "$prom_pod" -n "$MONITORING_NS" -c prometheus-server -- \
|
||||
wget -q -O- "http://localhost:9090/api/v1/targets" 2>/dev/null | \
|
||||
python3 -c "
|
||||
import sys, json
|
||||
try:
|
||||
data = json.load(sys.stdin)
|
||||
active = data.get('data',{}).get('activeTargets',[])
|
||||
up = sum(1 for t in active if t.get('health') == 'up')
|
||||
total = len(active)
|
||||
print(f'{up}/{total}')
|
||||
except: print('unknown')
|
||||
" 2>/dev/null || echo "unknown")
|
||||
add_check "prometheus" "ok" "Prometheus server healthy, targets: $targets_up up"
|
||||
else
|
||||
add_check "prometheus" "warn" "Prometheus server running but health check unclear"
|
||||
fi
|
||||
}
|
||||
|
||||
check_alertmanager() {
|
||||
if $DRY_RUN; then
|
||||
add_check "alertmanager" "ok" "dry-run: would check Alertmanager health"
|
||||
return
|
||||
fi
|
||||
|
||||
# Discover Alertmanager pod
|
||||
local am_pod
|
||||
am_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app.kubernetes.io/name=alertmanager -o name 2>/dev/null | head -1)
|
||||
if [ -z "$am_pod" ]; then
|
||||
am_pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep alertmanager | head -1)
|
||||
fi
|
||||
|
||||
if [ -z "$am_pod" ]; then
|
||||
add_check "alertmanager" "fail" "No Alertmanager pod found in $MONITORING_NS"
|
||||
return
|
||||
fi
|
||||
|
||||
local phase
|
||||
phase=$($KUBECTL get "$am_pod" -n "$MONITORING_NS" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||
if [ "$phase" != "Running" ]; then
|
||||
add_check "alertmanager" "fail" "Alertmanager pod phase: $phase"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check firing alerts
|
||||
local alert_info
|
||||
alert_info=$($KUBECTL exec "$am_pod" -n "$MONITORING_NS" -- \
|
||||
wget -q -O- "http://localhost:9093/api/v2/alerts?active=true" 2>/dev/null | \
|
||||
python3 -c "
|
||||
import sys, json
|
||||
try:
|
||||
alerts = json.load(sys.stdin)
|
||||
firing = [a for a in alerts if a.get('status',{}).get('state') == 'active']
|
||||
print(len(firing))
|
||||
except: print('unknown')
|
||||
" 2>/dev/null || echo "unknown")
|
||||
|
||||
# Check silences
|
||||
local silence_count
|
||||
silence_count=$($KUBECTL exec "$am_pod" -n "$MONITORING_NS" -- \
|
||||
wget -q -O- "http://localhost:9093/api/v2/silences" 2>/dev/null | \
|
||||
python3 -c "
|
||||
import sys, json
|
||||
try:
|
||||
silences = json.load(sys.stdin)
|
||||
active = [s for s in silences if s.get('status',{}).get('state') == 'active']
|
||||
print(len(active))
|
||||
except: print('0')
|
||||
" 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$alert_info" = "unknown" ]; then
|
||||
add_check "alertmanager" "warn" "Alertmanager running but could not query alerts"
|
||||
else
|
||||
local status="ok"
|
||||
[ "$alert_info" -gt 0 ] 2>/dev/null && status="warn"
|
||||
add_check "alertmanager" "$status" "Alertmanager healthy: $alert_info firing alerts, $silence_count active silences"
|
||||
fi
|
||||
}
|
||||
|
||||
check_grafana() {
|
||||
if $DRY_RUN; then
|
||||
add_check "grafana" "ok" "dry-run: would check Grafana health"
|
||||
return
|
||||
fi
|
||||
|
||||
# Discover Grafana pod
|
||||
local grafana_pod
|
||||
grafana_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app.kubernetes.io/name=grafana -o name 2>/dev/null | head -1)
|
||||
if [ -z "$grafana_pod" ]; then
|
||||
grafana_pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep grafana | grep -v test | head -1)
|
||||
fi
|
||||
|
||||
if [ -z "$grafana_pod" ]; then
|
||||
add_check "grafana" "fail" "No Grafana pod found in $MONITORING_NS"
|
||||
return
|
||||
fi
|
||||
|
||||
local phase
|
||||
phase=$($KUBECTL get "$grafana_pod" -n "$MONITORING_NS" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||
if [ "$phase" != "Running" ]; then
|
||||
add_check "grafana" "fail" "Grafana pod phase: $phase"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check datasource connectivity
|
||||
local ds_info
|
||||
ds_info=$($KUBECTL exec "$grafana_pod" -n "$MONITORING_NS" -- \
|
||||
curl -sf "http://localhost:3000/api/datasources" 2>/dev/null | \
|
||||
python3 -c "
|
||||
import sys, json
|
||||
try:
|
||||
ds = json.load(sys.stdin)
|
||||
names = [d.get('name','?') for d in ds]
|
||||
print(f'{len(ds)} datasources: {\", \".join(names)}')
|
||||
except: print('unknown')
|
||||
" 2>/dev/null || echo "unknown")
|
||||
|
||||
if [ "$ds_info" = "unknown" ]; then
|
||||
add_check "grafana" "warn" "Grafana running but could not query datasources (may need auth)"
|
||||
else
|
||||
add_check "grafana" "ok" "Grafana healthy, $ds_info"
|
||||
fi
|
||||
}
|
||||
|
||||
check_snmp_exporters() {
|
||||
if $DRY_RUN; then
|
||||
add_check "snmp-exporters" "ok" "dry-run: would check SNMP exporter pods"
|
||||
return
|
||||
fi
|
||||
|
||||
local exporters=("snmp-exporter" "idrac-redfish-exporter" "proxmox-exporter")
|
||||
local running=0 total=0
|
||||
|
||||
for exporter in "${exporters[@]}"; do
|
||||
total=$((total + 1))
|
||||
local pod
|
||||
pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep "$exporter" | head -1)
|
||||
|
||||
if [ -z "$pod" ]; then
|
||||
# Try all namespaces
|
||||
pod=$($KUBECTL get pods --all-namespaces -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name --no-headers 2>/dev/null | \
|
||||
grep "$exporter" | head -1)
|
||||
if [ -z "$pod" ]; then
|
||||
add_check "exporter-$exporter" "warn" "$exporter pod not found"
|
||||
continue
|
||||
fi
|
||||
local ns
|
||||
ns=$(echo "$pod" | awk '{print $1}')
|
||||
local name
|
||||
name=$(echo "$pod" | awk '{print $2}')
|
||||
local phase
|
||||
phase=$($KUBECTL get pod "$name" -n "$ns" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||
if [ "$phase" = "Running" ]; then
|
||||
running=$((running + 1))
|
||||
add_check "exporter-$exporter" "ok" "$exporter running in $ns"
|
||||
else
|
||||
add_check "exporter-$exporter" "warn" "$exporter phase: $phase in $ns"
|
||||
fi
|
||||
else
|
||||
local phase
|
||||
phase=$($KUBECTL get "$pod" -n "$MONITORING_NS" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||
if [ "$phase" = "Running" ]; then
|
||||
running=$((running + 1))
|
||||
add_check "exporter-$exporter" "ok" "$exporter running"
|
||||
else
|
||||
add_check "exporter-$exporter" "warn" "$exporter phase: $phase"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
check_prometheus_storage() {
|
||||
if $DRY_RUN; then
|
||||
add_check "prometheus-storage" "ok" "dry-run: would check Prometheus storage usage"
|
||||
return
|
||||
fi
|
||||
|
||||
local prom_pvc
|
||||
prom_pvc=$($KUBECTL get pvc -n "$MONITORING_NS" -o name 2>/dev/null | grep prometheus-server | head -1)
|
||||
|
||||
if [ -z "$prom_pvc" ]; then
|
||||
add_check "prometheus-storage" "warn" "No Prometheus server PVC found"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check storage via Prometheus TSDB stats
|
||||
local prom_pod
|
||||
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app.kubernetes.io/name=prometheus,app.kubernetes.io/component=server -o name 2>/dev/null | head -1)
|
||||
if [ -z "$prom_pod" ]; then
|
||||
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep prometheus-server | head -1)
|
||||
fi
|
||||
|
||||
if [ -n "$prom_pod" ]; then
|
||||
local storage_info
|
||||
storage_info=$($KUBECTL exec "$prom_pod" -n "$MONITORING_NS" -c prometheus-server -- \
|
||||
df -h /data 2>/dev/null | tail -1 | awk '{printf "%s used of %s (%s)", $3, $2, $5}' || echo "unknown")
|
||||
add_check "prometheus-storage" "ok" "Prometheus storage: $storage_info"
|
||||
else
|
||||
add_check "prometheus-storage" "warn" "Could not check Prometheus storage"
|
||||
fi
|
||||
}
|
||||
|
||||
# Run checks
|
||||
check_prometheus
|
||||
check_alertmanager
|
||||
check_grafana
|
||||
check_snmp_exporters
|
||||
check_prometheus_storage
|
||||
|
||||
# Determine overall status
|
||||
overall="ok"
|
||||
for c in "${checks[@]}"; do
|
||||
if echo "$c" | grep -q '"status": "fail"'; then
|
||||
overall="fail"
|
||||
break
|
||||
elif echo "$c" | grep -q '"status": "warn"'; then
|
||||
overall="warn"
|
||||
fi
|
||||
done
|
||||
|
||||
# Output JSON
|
||||
checks_json=$(IFS=,; echo "${checks[*]}")
|
||||
cat <<EOF
|
||||
{"status": "$overall", "agent": "$AGENT", "checks": [$checks_json]}
|
||||
EOF
|
||||
|
|
@ -1,166 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
PFSENSE="python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py"
|
||||
AGENT="network-health"
|
||||
DRY_RUN=false
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
checks=()
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||
}
|
||||
|
||||
check_pfsense_status() {
|
||||
if $DRY_RUN; then
|
||||
add_check "pfsense" "ok" "dry-run: would check pfSense system status via pfsense.py"
|
||||
return
|
||||
fi
|
||||
|
||||
local pf_output
|
||||
pf_output=$($PFSENSE status 2>/dev/null) || {
|
||||
add_check "pfsense" "fail" "Failed to connect to pfSense via pfsense.py"
|
||||
return
|
||||
}
|
||||
|
||||
if echo "$pf_output" | grep -qi "error\|fail\|down"; then
|
||||
add_check "pfsense" "warn" "pfSense reported issues: $(echo "$pf_output" | head -3 | tr '\n' ' ')"
|
||||
else
|
||||
add_check "pfsense" "ok" "pfSense system healthy"
|
||||
fi
|
||||
}
|
||||
|
||||
check_vpn_status() {
|
||||
if $DRY_RUN; then
|
||||
add_check "vpn" "ok" "dry-run: would check VPN tunnel status via pfsense.py"
|
||||
return
|
||||
fi
|
||||
|
||||
local vpn_output
|
||||
vpn_output=$($PFSENSE wireguard 2>/dev/null) || {
|
||||
add_check "vpn" "warn" "Failed to query VPN status via pfsense.py"
|
||||
return
|
||||
}
|
||||
|
||||
if echo "$vpn_output" | grep -qi "error\|fail\|down"; then
|
||||
add_check "vpn" "warn" "VPN issues detected: $(echo "$vpn_output" | head -3 | tr '\n' ' ')"
|
||||
else
|
||||
add_check "vpn" "ok" "VPN tunnels healthy"
|
||||
fi
|
||||
}
|
||||
|
||||
check_metallb_speakers() {
|
||||
if $DRY_RUN; then
|
||||
add_check "metallb-speakers" "ok" "dry-run: would check MetalLB speaker pod health"
|
||||
return
|
||||
fi
|
||||
|
||||
local ns="metallb-system"
|
||||
|
||||
# Find MetalLB speaker pods via labels first
|
||||
local speaker_pods
|
||||
speaker_pods=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/component=speaker --no-headers 2>/dev/null) || \
|
||||
speaker_pods=$($KUBECTL get pods -n "$ns" -l component=speaker --no-headers 2>/dev/null) || \
|
||||
speaker_pods=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i speaker || true)
|
||||
|
||||
if [ -z "$speaker_pods" ]; then
|
||||
add_check "metallb-speakers" "warn" "No MetalLB speaker pods found in ${ns}"
|
||||
return
|
||||
fi
|
||||
|
||||
local total not_running
|
||||
total=$(echo "$speaker_pods" | grep -c "." 2>/dev/null || echo "0")
|
||||
not_running=$(echo "$speaker_pods" | grep -v "Running" | grep -c "." 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$not_running" -gt 0 ]; then
|
||||
add_check "metallb-speakers" "fail" "${not_running}/${total} MetalLB speaker pod(s) not running"
|
||||
else
|
||||
add_check "metallb-speakers" "ok" "All ${total} MetalLB speaker pod(s) running"
|
||||
fi
|
||||
}
|
||||
|
||||
check_metallb_l2() {
|
||||
if $DRY_RUN; then
|
||||
add_check "metallb-l2" "ok" "dry-run: would check MetalLB L2 advertisements"
|
||||
return
|
||||
fi
|
||||
|
||||
local ns="metallb-system"
|
||||
|
||||
# Check L2Advertisement CRDs
|
||||
local l2_ads
|
||||
l2_ads=$($KUBECTL get l2advertisements -n "$ns" -o json 2>/dev/null) || {
|
||||
add_check "metallb-l2" "warn" "Could not query L2Advertisement CRDs"
|
||||
return
|
||||
}
|
||||
|
||||
local count
|
||||
count=$(echo "$l2_ads" | jq '.items | length' 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$count" -eq 0 ]; then
|
||||
add_check "metallb-l2" "warn" "No L2Advertisement resources found"
|
||||
else
|
||||
# Check MetalLB controller
|
||||
local controller
|
||||
controller=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/component=controller --no-headers 2>/dev/null) || \
|
||||
controller=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i controller || true)
|
||||
|
||||
if [ -z "$controller" ]; then
|
||||
add_check "metallb-l2" "warn" "${count} L2Advertisement(s) found but no controller pod"
|
||||
elif echo "$controller" | grep -q "Running"; then
|
||||
add_check "metallb-l2" "ok" "${count} L2Advertisement(s) configured, controller running"
|
||||
else
|
||||
add_check "metallb-l2" "warn" "${count} L2Advertisement(s) found but controller not running"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
check_node_connectivity() {
|
||||
if $DRY_RUN; then
|
||||
add_check "node-connectivity" "ok" "dry-run: would ping k8s nodes"
|
||||
return
|
||||
fi
|
||||
|
||||
local nodes=("10.0.20.100" "10.0.20.101" "10.0.20.102" "10.0.20.103" "10.0.20.104")
|
||||
local names=("k8s-master" "k8s-node1" "k8s-node2" "k8s-node3" "k8s-node4")
|
||||
local failures=0
|
||||
local failure_details=""
|
||||
|
||||
for i in "${!nodes[@]}"; do
|
||||
if ! ping -c 1 -W 2 "${nodes[$i]}" >/dev/null 2>&1; then
|
||||
failures=$((failures + 1))
|
||||
failure_details="${failure_details}${names[$i]}(${nodes[$i]}) "
|
||||
fi
|
||||
done
|
||||
|
||||
if [ "$failures" -gt 0 ]; then
|
||||
add_check "node-connectivity" "fail" "${failures} node(s) unreachable: ${failure_details}"
|
||||
else
|
||||
add_check "node-connectivity" "ok" "All ${#nodes[@]} nodes reachable"
|
||||
fi
|
||||
}
|
||||
|
||||
check_pfsense_status
|
||||
check_vpn_status
|
||||
check_metallb_speakers
|
||||
check_metallb_l2
|
||||
check_node_connectivity
|
||||
|
||||
# Output JSON
|
||||
overall="ok"
|
||||
for c in "${checks[@]}"; do
|
||||
s=$(echo "$c" | jq -r '.status')
|
||||
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||
done
|
||||
|
||||
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||
|
|
@ -1,174 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
AGENT="nfs-health"
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
NFS_HOST="192.168.1.127"
|
||||
NODES=("k8s-master:10.0.20.100" "k8s-node1:10.0.20.101" "k8s-node2:10.0.20.102" "k8s-node3:10.0.20.103" "k8s-node4:10.0.20.104")
|
||||
SSH_USER="wizard"
|
||||
DRY_RUN=false
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
checks=()
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||
}
|
||||
|
||||
check_nfs_reachable() {
|
||||
if $DRY_RUN; then
|
||||
add_check "nfs-reachable" "ok" "dry-run: would ping $NFS_HOST"
|
||||
return
|
||||
fi
|
||||
if timeout 5 ping -c 1 "$NFS_HOST" &>/dev/null; then
|
||||
add_check "nfs-reachable" "ok" "Proxmox NFS at $NFS_HOST is reachable"
|
||||
else
|
||||
add_check "nfs-reachable" "fail" "Proxmox NFS at $NFS_HOST is unreachable"
|
||||
fi
|
||||
}
|
||||
|
||||
check_nfs_exports() {
|
||||
if $DRY_RUN; then
|
||||
add_check "nfs-exports" "ok" "dry-run: would check NFS exports on Proxmox"
|
||||
return
|
||||
fi
|
||||
local result
|
||||
if result=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$NFS_HOST" \
|
||||
"exportfs -v 2>/dev/null || cat /etc/exports 2>/dev/null" 2>/dev/null); then
|
||||
local export_count
|
||||
export_count=$(echo "$result" | grep -c '/' || echo 0)
|
||||
if [ "$export_count" -gt 0 ]; then
|
||||
add_check "nfs-exports" "ok" "$export_count NFS exports active on Proxmox"
|
||||
else
|
||||
add_check "nfs-exports" "warn" "No NFS exports found on Proxmox"
|
||||
fi
|
||||
else
|
||||
add_check "nfs-exports" "fail" "Could not check NFS exports on Proxmox via SSH"
|
||||
fi
|
||||
}
|
||||
|
||||
check_nfs_disk_usage() {
|
||||
if $DRY_RUN; then
|
||||
add_check "nfs-disk" "ok" "dry-run: would check NFS disk usage"
|
||||
return
|
||||
fi
|
||||
local result
|
||||
if result=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$NFS_HOST" \
|
||||
"df -h /srv/nfs /srv/nfs-ssd 2>/dev/null" 2>/dev/null); then
|
||||
while IFS= read -r line; do
|
||||
local mount pct
|
||||
mount=$(echo "$line" | awk '{print $6}')
|
||||
pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
|
||||
[ -z "$pct" ] || ! [[ "$pct" =~ ^[0-9]+$ ]] && continue
|
||||
if [ "$pct" -ge 90 ]; then
|
||||
add_check "nfs-disk-$mount" "fail" "$mount is ${pct}% full"
|
||||
elif [ "$pct" -ge 80 ]; then
|
||||
add_check "nfs-disk-$mount" "warn" "$mount is ${pct}% full"
|
||||
else
|
||||
add_check "nfs-disk-$mount" "ok" "$mount is ${pct}% full"
|
||||
fi
|
||||
done <<< "$result"
|
||||
else
|
||||
add_check "nfs-disk" "warn" "Could not check NFS disk usage"
|
||||
fi
|
||||
}
|
||||
|
||||
check_node_nfs_mounts() {
|
||||
local node_name="$1" node_ip="$2"
|
||||
|
||||
if $DRY_RUN; then
|
||||
add_check "nfs-mounts-$node_name" "ok" "dry-run: would check NFS mounts on $node_name ($node_ip)"
|
||||
return
|
||||
fi
|
||||
|
||||
local mount_output
|
||||
if ! mount_output=$(timeout 15 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$SSH_USER@$node_ip" \
|
||||
"mount | grep nfs" 2>/dev/null); then
|
||||
add_check "nfs-mounts-$node_name" "warn" "No NFS mounts found or SSH failed on $node_name ($node_ip)"
|
||||
return
|
||||
fi
|
||||
|
||||
if [ -z "$mount_output" ]; then
|
||||
add_check "nfs-mounts-$node_name" "warn" "No NFS mounts found on $node_name"
|
||||
return
|
||||
fi
|
||||
|
||||
local mount_count
|
||||
mount_count=$(echo "$mount_output" | wc -l | tr -d ' ')
|
||||
|
||||
# Check for stale mounts by trying to stat each mount point
|
||||
local stale_count=0
|
||||
local stale_mounts=""
|
||||
while IFS= read -r line; do
|
||||
local mount_point
|
||||
mount_point=$(echo "$line" | awk '{print $3}')
|
||||
if [ -n "$mount_point" ]; then
|
||||
if ! timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$SSH_USER@$node_ip" \
|
||||
"timeout 5 stat '$mount_point' >/dev/null 2>&1" 2>/dev/null; then
|
||||
stale_count=$((stale_count + 1))
|
||||
stale_mounts="$stale_mounts $mount_point"
|
||||
fi
|
||||
fi
|
||||
done <<< "$mount_output"
|
||||
|
||||
if [ "$stale_count" -gt 0 ]; then
|
||||
add_check "nfs-mounts-$node_name" "fail" "$stale_count/$mount_count NFS mounts stale on $node_name:$stale_mounts"
|
||||
else
|
||||
add_check "nfs-mounts-$node_name" "ok" "$mount_count NFS mounts healthy on $node_name"
|
||||
fi
|
||||
}
|
||||
|
||||
check_nfs_pvcs() {
|
||||
if $DRY_RUN; then
|
||||
add_check "nfs-pvcs" "ok" "dry-run: would check NFS-backed PVCs"
|
||||
return
|
||||
fi
|
||||
|
||||
local pending
|
||||
pending=$($KUBECTL get pvc --all-namespaces --field-selector='status.phase!=Bound' -o json 2>/dev/null | \
|
||||
python3 -c "import sys,json; items=json.load(sys.stdin).get('items',[]); nfs=[i for i in items if 'nfs' in json.dumps(i).lower()]; print(len(nfs))" 2>/dev/null || echo "error")
|
||||
|
||||
if [ "$pending" = "error" ]; then
|
||||
add_check "nfs-pvcs" "warn" "Could not check NFS PVC status"
|
||||
elif [ "$pending" = "0" ]; then
|
||||
add_check "nfs-pvcs" "ok" "All NFS-backed PVCs are bound"
|
||||
else
|
||||
add_check "nfs-pvcs" "fail" "$pending NFS-backed PVCs are not bound"
|
||||
fi
|
||||
}
|
||||
|
||||
# Run checks
|
||||
check_nfs_reachable
|
||||
check_nfs_exports
|
||||
check_nfs_disk_usage
|
||||
|
||||
for node_entry in "${NODES[@]}"; do
|
||||
node_name="${node_entry%%:*}"
|
||||
node_ip="${node_entry##*:}"
|
||||
check_node_nfs_mounts "$node_name" "$node_ip"
|
||||
done
|
||||
|
||||
check_nfs_pvcs
|
||||
|
||||
# Determine overall status
|
||||
overall="ok"
|
||||
for c in "${checks[@]}"; do
|
||||
if echo "$c" | grep -q '"status": "fail"'; then
|
||||
overall="fail"
|
||||
break
|
||||
elif echo "$c" | grep -q '"status": "warn"'; then
|
||||
overall="warn"
|
||||
fi
|
||||
done
|
||||
|
||||
# Output JSON
|
||||
checks_json=$(IFS=,; echo "${checks[*]}")
|
||||
cat <<EOF
|
||||
{"status": "$overall", "agent": "$AGENT", "checks": [$checks_json]}
|
||||
EOF
|
||||
|
|
@ -1,214 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
DRY_RUN=false
|
||||
AGENT="oom-investigator"
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
CHECKS="[]"
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||
json.dump(checks, sys.stdout)
|
||||
")
|
||||
}
|
||||
|
||||
# Find OOMKilled pods across all namespaces
|
||||
find_oomkilled() {
|
||||
if $DRY_RUN; then
|
||||
add_check "oom-killed-pods" "ok" "DRY RUN: would check for OOMKilled pods across all namespaces"
|
||||
return
|
||||
fi
|
||||
|
||||
local oom_pods
|
||||
oom_pods=$($KUBECTL get pods --all-namespaces -o json | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
results = []
|
||||
for pod in data.get('items', []):
|
||||
ns = pod['metadata']['namespace']
|
||||
name = pod['metadata']['name']
|
||||
for cs in pod.get('status', {}).get('containerStatuses', []) + pod.get('status', {}).get('initContainerStatuses', []):
|
||||
last = cs.get('lastState', {}).get('terminated', {})
|
||||
current = cs.get('state', {}).get('terminated', {})
|
||||
for state in [last, current]:
|
||||
if state.get('reason') == 'OOMKilled':
|
||||
container = cs['name']
|
||||
restart_count = cs.get('restartCount', 0)
|
||||
finished = state.get('finishedAt', 'unknown')
|
||||
results.append({'namespace': ns, 'pod': name, 'container': container, 'restarts': restart_count, 'finishedAt': finished})
|
||||
json.dump(results, sys.stdout)
|
||||
" 2>/dev/null) || oom_pods="[]"
|
||||
|
||||
local count
|
||||
count=$(echo "$oom_pods" | python3 -c "import sys,json; print(len(json.load(sys.stdin)))")
|
||||
|
||||
if [ "$count" -eq 0 ]; then
|
||||
add_check "oom-killed-pods" "ok" "No OOMKilled pods found"
|
||||
else
|
||||
add_check "oom-killed-pods" "fail" "Found $count OOMKilled container(s): $(echo "$oom_pods" | python3 -c "
|
||||
import sys,json
|
||||
pods = json.load(sys.stdin)
|
||||
print('; '.join(f\"{p['namespace']}/{p['pod']}:{p['container']} (restarts={p['restarts']}, at={p['finishedAt']})\" for p in pods))
|
||||
")"
|
||||
fi
|
||||
}
|
||||
|
||||
# Check LimitRange defaults in namespaces with OOM events
|
||||
check_limitranges() {
|
||||
if $DRY_RUN; then
|
||||
add_check "limitranges" "ok" "DRY RUN: would check LimitRange defaults"
|
||||
return
|
||||
fi
|
||||
|
||||
local namespaces
|
||||
namespaces=$($KUBECTL get pods --all-namespaces -o json | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
ns_set = set()
|
||||
for pod in data.get('items', []):
|
||||
for cs in pod.get('status', {}).get('containerStatuses', []) + pod.get('status', {}).get('initContainerStatuses', []):
|
||||
for state in [cs.get('lastState', {}).get('terminated', {}), cs.get('state', {}).get('terminated', {})]:
|
||||
if state.get('reason') == 'OOMKilled':
|
||||
ns_set.add(pod['metadata']['namespace'])
|
||||
for ns in sorted(ns_set):
|
||||
print(ns)
|
||||
" 2>/dev/null) || namespaces=""
|
||||
|
||||
if [ -z "$namespaces" ]; then
|
||||
add_check "limitranges" "ok" "No namespaces with OOMKilled pods to check"
|
||||
return
|
||||
fi
|
||||
|
||||
local lr_info=""
|
||||
while IFS= read -r ns; do
|
||||
local lr
|
||||
lr=$($KUBECTL get limitrange -n "$ns" -o json 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
for item in data.get('items', []):
|
||||
for limit in item.get('spec', {}).get('limits', []):
|
||||
if limit.get('type') == 'Container':
|
||||
default_mem = limit.get('default', {}).get('memory', 'none')
|
||||
default_cpu = limit.get('default', {}).get('cpu', 'none')
|
||||
print(f'$ns: default memory={default_mem}, cpu={default_cpu}')
|
||||
" 2>/dev/null) || lr=""
|
||||
if [ -n "$lr" ]; then
|
||||
lr_info="${lr_info}${lr}; "
|
||||
else
|
||||
lr_info="${lr_info}${ns}: no LimitRange; "
|
||||
fi
|
||||
done <<< "$namespaces"
|
||||
|
||||
add_check "limitranges" "warn" "LimitRange defaults for OOM namespaces: ${lr_info}"
|
||||
}
|
||||
|
||||
# Check VPA recommendations from Goldilocks
|
||||
check_vpa_recommendations() {
|
||||
if $DRY_RUN; then
|
||||
add_check "vpa-recommendations" "ok" "DRY RUN: would check VPA recommendations"
|
||||
return
|
||||
fi
|
||||
|
||||
local vpa_count
|
||||
vpa_count=$($KUBECTL get vpa --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ') || vpa_count=0
|
||||
|
||||
if [ "$vpa_count" -eq 0 ]; then
|
||||
add_check "vpa-recommendations" "warn" "No VPA objects found — Goldilocks may not be deployed"
|
||||
return
|
||||
fi
|
||||
|
||||
local vpa_recs
|
||||
vpa_recs=$($KUBECTL get vpa --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
recs = []
|
||||
for vpa in data.get('items', []):
|
||||
ns = vpa['metadata']['namespace']
|
||||
name = vpa['metadata']['name']
|
||||
for cr in vpa.get('status', {}).get('recommendation', {}).get('containerRecommendations', []):
|
||||
container = cr.get('containerName', 'unknown')
|
||||
target_mem = cr.get('target', {}).get('memory', 'n/a')
|
||||
target_cpu = cr.get('target', {}).get('cpu', 'n/a')
|
||||
upper_mem = cr.get('upperBound', {}).get('memory', 'n/a')
|
||||
recs.append(f'{ns}/{name}:{container} target_mem={target_mem} target_cpu={target_cpu} upper_mem={upper_mem}')
|
||||
if recs:
|
||||
print('; '.join(recs[:20]))
|
||||
else:
|
||||
print('No recommendations available yet')
|
||||
" 2>/dev/null) || vpa_recs="Failed to read VPA recommendations"
|
||||
|
||||
add_check "vpa-recommendations" "ok" "$vpa_recs"
|
||||
}
|
||||
|
||||
# Check resource requests/limits on OOMKilled pods
|
||||
check_pod_resources() {
|
||||
if $DRY_RUN; then
|
||||
add_check "pod-resources" "ok" "DRY RUN: would check pod resource specs"
|
||||
return
|
||||
fi
|
||||
|
||||
local resources
|
||||
resources=$($KUBECTL get pods --all-namespaces -o json | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
results = []
|
||||
for pod in data.get('items', []):
|
||||
ns = pod['metadata']['namespace']
|
||||
name = pod['metadata']['name']
|
||||
has_oom = False
|
||||
for cs in pod.get('status', {}).get('containerStatuses', []) + pod.get('status', {}).get('initContainerStatuses', []):
|
||||
for state in [cs.get('lastState', {}).get('terminated', {}), cs.get('state', {}).get('terminated', {})]:
|
||||
if state.get('reason') == 'OOMKilled':
|
||||
has_oom = True
|
||||
break
|
||||
if has_oom:
|
||||
for c in pod.get('spec', {}).get('containers', []) + pod.get('spec', {}).get('initContainers', []):
|
||||
req_mem = c.get('resources', {}).get('requests', {}).get('memory', 'none')
|
||||
lim_mem = c.get('resources', {}).get('limits', {}).get('memory', 'none')
|
||||
req_cpu = c.get('resources', {}).get('requests', {}).get('cpu', 'none')
|
||||
lim_cpu = c.get('resources', {}).get('limits', {}).get('cpu', 'none')
|
||||
results.append(f\"{ns}/{name}:{c['name']} req_mem={req_mem} lim_mem={lim_mem} req_cpu={req_cpu} lim_cpu={lim_cpu}\")
|
||||
if results:
|
||||
print('; '.join(results))
|
||||
else:
|
||||
print('No OOMKilled pods to inspect')
|
||||
" 2>/dev/null) || resources="Failed to check pod resources"
|
||||
|
||||
if echo "$resources" | grep -q "No OOMKilled"; then
|
||||
add_check "pod-resources" "ok" "$resources"
|
||||
else
|
||||
add_check "pod-resources" "warn" "$resources"
|
||||
fi
|
||||
}
|
||||
|
||||
# Run all checks
|
||||
find_oomkilled
|
||||
check_limitranges
|
||||
check_vpa_recommendations
|
||||
check_pod_resources
|
||||
|
||||
# Determine overall status
|
||||
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
statuses = [c['status'] for c in checks]
|
||||
if 'fail' in statuses:
|
||||
print('fail')
|
||||
elif 'warn' in statuses:
|
||||
print('warn')
|
||||
else:
|
||||
print('ok')
|
||||
")
|
||||
|
||||
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||
|
|
@ -1,260 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
AGENT="platform-status"
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
PROXMOX_HOST="root@192.168.1.127"
|
||||
REGISTRY_HOST="10.0.20.10"
|
||||
DRY_RUN=false
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
checks=()
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||
}
|
||||
|
||||
check_traefik() {
|
||||
if $DRY_RUN; then
|
||||
add_check "traefik" "ok" "dry-run: would check Traefik status"
|
||||
return
|
||||
fi
|
||||
|
||||
# Discover Traefik pods via labels
|
||||
local traefik_pod
|
||||
traefik_pod=$($KUBECTL get pods -n traefik -l app.kubernetes.io/name=traefik -o name 2>/dev/null | head -1)
|
||||
if [ -z "$traefik_pod" ]; then
|
||||
traefik_pod=$($KUBECTL get pods -n traefik -l app=traefik -o name 2>/dev/null | head -1)
|
||||
fi
|
||||
|
||||
if [ -z "$traefik_pod" ]; then
|
||||
add_check "traefik" "fail" "No Traefik pods found in traefik namespace"
|
||||
return
|
||||
fi
|
||||
|
||||
local phase
|
||||
phase=$($KUBECTL get "$traefik_pod" -n traefik -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||
if [ "$phase" = "Running" ]; then
|
||||
# Check IngressRoute count
|
||||
local ir_count
|
||||
ir_count=$($KUBECTL get ingressroute --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ')
|
||||
add_check "traefik" "ok" "Traefik running, $ir_count IngressRoutes configured"
|
||||
else
|
||||
add_check "traefik" "fail" "Traefik pod phase: $phase"
|
||||
fi
|
||||
|
||||
# Check for IngressRoutes with errors (TLS or service issues)
|
||||
local ir_errors
|
||||
ir_errors=$($KUBECTL get events --all-namespaces --field-selector reason=IngressRouteError --no-headers 2>/dev/null | wc -l | tr -d ' ')
|
||||
if [ "$ir_errors" -gt 0 ]; then
|
||||
add_check "traefik-ingressroutes" "warn" "$ir_errors IngressRoute error events found"
|
||||
fi
|
||||
}
|
||||
|
||||
check_kyverno() {
|
||||
if $DRY_RUN; then
|
||||
add_check "kyverno" "ok" "dry-run: would check Kyverno status"
|
||||
return
|
||||
fi
|
||||
|
||||
# Discover Kyverno pods via labels
|
||||
local kyverno_pods
|
||||
kyverno_pods=$($KUBECTL get pods -n kyverno -l app.kubernetes.io/name=kyverno -o name 2>/dev/null)
|
||||
if [ -z "$kyverno_pods" ]; then
|
||||
kyverno_pods=$($KUBECTL get pods -n kyverno -l app=kyverno -o name 2>/dev/null)
|
||||
fi
|
||||
|
||||
if [ -z "$kyverno_pods" ]; then
|
||||
add_check "kyverno" "warn" "No Kyverno pods found"
|
||||
return
|
||||
fi
|
||||
|
||||
local total=0 ready=0
|
||||
while IFS= read -r pod; do
|
||||
[ -z "$pod" ] && continue
|
||||
total=$((total + 1))
|
||||
local phase
|
||||
phase=$($KUBECTL get "$pod" -n kyverno -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||
[ "$phase" = "Running" ] && ready=$((ready + 1))
|
||||
done <<< "$kyverno_pods"
|
||||
|
||||
if [ "$ready" -eq "$total" ]; then
|
||||
# Check policy count
|
||||
local policy_count
|
||||
policy_count=$($KUBECTL get clusterpolicy --no-headers 2>/dev/null | wc -l | tr -d ' ')
|
||||
add_check "kyverno" "ok" "$ready/$total Kyverno pods running, $policy_count ClusterPolicies"
|
||||
else
|
||||
add_check "kyverno" "warn" "$ready/$total Kyverno pods running"
|
||||
fi
|
||||
|
||||
# Check for policy violations
|
||||
local violations
|
||||
violations=$($KUBECTL get policyreport --all-namespaces -o json 2>/dev/null | \
|
||||
python3 -c "
|
||||
import sys, json
|
||||
try:
|
||||
data = json.load(sys.stdin)
|
||||
fail_count = sum(r.get('summary',{}).get('fail',0) for r in data.get('items',[]))
|
||||
print(fail_count)
|
||||
except: print('0')
|
||||
" 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$violations" -gt 0 ]; then
|
||||
add_check "kyverno-violations" "warn" "$violations policy violations across namespaces"
|
||||
fi
|
||||
}
|
||||
|
||||
check_vpa_goldilocks() {
|
||||
if $DRY_RUN; then
|
||||
add_check "vpa-goldilocks" "ok" "dry-run: would check VPA/Goldilocks status"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check VPA admission controller
|
||||
local vpa_pods
|
||||
vpa_pods=$($KUBECTL get pods -n goldilocks -l app.kubernetes.io/name=goldilocks -o name 2>/dev/null)
|
||||
if [ -z "$vpa_pods" ]; then
|
||||
vpa_pods=$($KUBECTL get pods -n goldilocks -o name 2>/dev/null)
|
||||
fi
|
||||
|
||||
if [ -z "$vpa_pods" ]; then
|
||||
add_check "vpa-goldilocks" "warn" "No Goldilocks pods found"
|
||||
return
|
||||
fi
|
||||
|
||||
local total=0 ready=0
|
||||
while IFS= read -r pod; do
|
||||
[ -z "$pod" ] && continue
|
||||
total=$((total + 1))
|
||||
local phase
|
||||
phase=$($KUBECTL get "$pod" -n goldilocks -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||
[ "$phase" = "Running" ] && ready=$((ready + 1))
|
||||
done <<< "$vpa_pods"
|
||||
|
||||
if [ "$ready" -eq "$total" ]; then
|
||||
local vpa_count
|
||||
vpa_count=$($KUBECTL get vpa --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ')
|
||||
add_check "vpa-goldilocks" "ok" "$ready/$total Goldilocks pods running, $vpa_count VPAs configured"
|
||||
else
|
||||
add_check "vpa-goldilocks" "warn" "$ready/$total Goldilocks pods running"
|
||||
fi
|
||||
|
||||
# Check for VPAs with unexpected updateMode
|
||||
local auto_vpas
|
||||
auto_vpas=$($KUBECTL get vpa --all-namespaces -o json 2>/dev/null | \
|
||||
python3 -c "
|
||||
import sys, json
|
||||
try:
|
||||
data = json.load(sys.stdin)
|
||||
auto = [i['metadata']['name'] for i in data.get('items',[]) if i.get('spec',{}).get('updatePolicy',{}).get('updateMode','') == 'Auto']
|
||||
print(len(auto))
|
||||
except: print('0')
|
||||
" 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$auto_vpas" -gt 0 ]; then
|
||||
add_check "vpa-auto-mode" "warn" "$auto_vpas VPAs set to Auto updateMode (may cause unexpected restarts)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_pull_through_cache() {
|
||||
if $DRY_RUN; then
|
||||
add_check "pull-through-cache" "ok" "dry-run: would check pull-through cache at $REGISTRY_HOST"
|
||||
return
|
||||
fi
|
||||
|
||||
if timeout 5 curl -sf "http://${REGISTRY_HOST}:5000/v2/" &>/dev/null; then
|
||||
add_check "pull-through-cache" "ok" "Pull-through cache registry at $REGISTRY_HOST:5000 is healthy"
|
||||
elif timeout 5 curl -sf "https://${REGISTRY_HOST}/v2/" &>/dev/null; then
|
||||
add_check "pull-through-cache" "ok" "Pull-through cache registry at $REGISTRY_HOST is healthy (HTTPS)"
|
||||
else
|
||||
add_check "pull-through-cache" "fail" "Pull-through cache registry at $REGISTRY_HOST is unreachable"
|
||||
fi
|
||||
}
|
||||
|
||||
check_proxmox() {
|
||||
if $DRY_RUN; then
|
||||
add_check "proxmox" "ok" "dry-run: would check Proxmox host resources"
|
||||
return
|
||||
fi
|
||||
|
||||
local cpu_load
|
||||
if cpu_load=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$PROXMOX_HOST" \
|
||||
"uptime | awk -F'load average:' '{print \$2}' | awk -F, '{print \$1}' | tr -d ' '" 2>/dev/null); then
|
||||
local cpu_count
|
||||
cpu_count=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$PROXMOX_HOST" \
|
||||
"nproc" 2>/dev/null || echo "1")
|
||||
|
||||
# Check memory
|
||||
local mem_info
|
||||
mem_info=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$PROXMOX_HOST" \
|
||||
"free -m | awk '/Mem:/{printf \"%d/%dMB (%.0f%%)\", \$3, \$2, \$3/\$2*100}'" 2>/dev/null || echo "unknown")
|
||||
|
||||
add_check "proxmox" "ok" "Proxmox host: load=$cpu_load (${cpu_count}cores), mem=$mem_info"
|
||||
else
|
||||
add_check "proxmox" "fail" "Could not reach Proxmox host via SSH"
|
||||
fi
|
||||
}
|
||||
|
||||
check_metallb() {
|
||||
if $DRY_RUN; then
|
||||
add_check "metallb" "ok" "dry-run: would check MetalLB status"
|
||||
return
|
||||
fi
|
||||
|
||||
local metallb_pods
|
||||
metallb_pods=$($KUBECTL get pods -n metallb-system -l app.kubernetes.io/name=metallb -o name 2>/dev/null)
|
||||
if [ -z "$metallb_pods" ]; then
|
||||
metallb_pods=$($KUBECTL get pods -n metallb-system -o name 2>/dev/null)
|
||||
fi
|
||||
|
||||
if [ -z "$metallb_pods" ]; then
|
||||
add_check "metallb" "warn" "No MetalLB pods found"
|
||||
return
|
||||
fi
|
||||
|
||||
local total=0 ready=0
|
||||
while IFS= read -r pod; do
|
||||
[ -z "$pod" ] && continue
|
||||
total=$((total + 1))
|
||||
local phase
|
||||
phase=$($KUBECTL get "$pod" -n metallb-system -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||
[ "$phase" = "Running" ] && ready=$((ready + 1))
|
||||
done <<< "$metallb_pods"
|
||||
|
||||
if [ "$ready" -eq "$total" ]; then
|
||||
add_check "metallb" "ok" "$ready/$total MetalLB pods running"
|
||||
else
|
||||
add_check "metallb" "warn" "$ready/$total MetalLB pods running"
|
||||
fi
|
||||
}
|
||||
|
||||
# Run checks
|
||||
check_traefik
|
||||
check_kyverno
|
||||
check_vpa_goldilocks
|
||||
check_pull_through_cache
|
||||
check_proxmox
|
||||
check_metallb
|
||||
|
||||
# Determine overall status
|
||||
overall="ok"
|
||||
for c in "${checks[@]}"; do
|
||||
if echo "$c" | grep -q '"status": "fail"'; then
|
||||
overall="fail"
|
||||
break
|
||||
elif echo "$c" | grep -q '"status": "warn"'; then
|
||||
overall="warn"
|
||||
fi
|
||||
done
|
||||
|
||||
# Output JSON
|
||||
checks_json=$(IFS=,; echo "${checks[*]}")
|
||||
cat <<EOF
|
||||
{"status": "$overall", "agent": "$AGENT", "checks": [$checks_json]}
|
||||
EOF
|
||||
|
|
@ -1,190 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
DRY_RUN=false
|
||||
AGENT="resource-report"
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
CHECKS="[]"
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||
json.dump(checks, sys.stdout)
|
||||
")
|
||||
}
|
||||
|
||||
# Node capacity report: allocatable vs requests vs limits
|
||||
check_node_capacity() {
|
||||
if $DRY_RUN; then
|
||||
add_check "node-capacity" "ok" "DRY RUN: would report node allocatable vs requests vs limits"
|
||||
return
|
||||
fi
|
||||
|
||||
local report
|
||||
report=$($KUBECTL get nodes -o json | python3 -c "
|
||||
import sys, json
|
||||
|
||||
def parse_cpu(val):
|
||||
if val.endswith('m'):
|
||||
return int(val[:-1])
|
||||
return int(float(val) * 1000)
|
||||
|
||||
def parse_mem(val):
|
||||
units = {'Ki': 1024, 'Mi': 1024**2, 'Gi': 1024**3, 'Ti': 1024**4}
|
||||
for suffix, mult in units.items():
|
||||
if val.endswith(suffix):
|
||||
return int(float(val[:-len(suffix)]) * mult)
|
||||
return int(val)
|
||||
|
||||
def fmt_mem(b):
|
||||
return f'{b / (1024**3):.1f}Gi'
|
||||
|
||||
def fmt_cpu(m):
|
||||
return f'{m}m'
|
||||
|
||||
data = json.load(sys.stdin)
|
||||
nodes = []
|
||||
for node in data.get('items', []):
|
||||
name = node['metadata']['name']
|
||||
alloc = node.get('status', {}).get('allocatable', {})
|
||||
cpu_alloc = parse_cpu(alloc.get('cpu', '0'))
|
||||
mem_alloc = parse_mem(alloc.get('memory', '0'))
|
||||
nodes.append({'name': name, 'cpu_alloc': cpu_alloc, 'mem_alloc': mem_alloc})
|
||||
|
||||
for n in nodes:
|
||||
print(f\"{n['name']}: cpu_alloc={fmt_cpu(n['cpu_alloc'])} mem_alloc={fmt_mem(n['mem_alloc'])}\")
|
||||
" 2>/dev/null) || report="Failed to get node capacity"
|
||||
|
||||
# Get requests/limits per node
|
||||
local usage
|
||||
usage=$($KUBECTL get pods --all-namespaces -o json | python3 -c "
|
||||
import sys, json
|
||||
|
||||
def parse_cpu(val):
|
||||
if not val: return 0
|
||||
if val.endswith('m'):
|
||||
return int(val[:-1])
|
||||
return int(float(val) * 1000)
|
||||
|
||||
def parse_mem(val):
|
||||
if not val: return 0
|
||||
units = {'Ki': 1024, 'Mi': 1024**2, 'Gi': 1024**3, 'Ti': 1024**4}
|
||||
for suffix, mult in units.items():
|
||||
if val.endswith(suffix):
|
||||
return int(float(val[:-len(suffix)]) * mult)
|
||||
return int(val)
|
||||
|
||||
def fmt_mem(b):
|
||||
return f'{b / (1024**3):.1f}Gi'
|
||||
|
||||
def fmt_cpu(m):
|
||||
return f'{m}m'
|
||||
|
||||
data = json.load(sys.stdin)
|
||||
per_node = {}
|
||||
for pod in data.get('items', []):
|
||||
phase = pod.get('status', {}).get('phase', '')
|
||||
if phase not in ('Running', 'Pending'):
|
||||
continue
|
||||
node = pod.get('spec', {}).get('nodeName', 'unscheduled')
|
||||
if node not in per_node:
|
||||
per_node[node] = {'cpu_req': 0, 'cpu_lim': 0, 'mem_req': 0, 'mem_lim': 0}
|
||||
for c in pod.get('spec', {}).get('containers', []) + pod.get('spec', {}).get('initContainers', []):
|
||||
res = c.get('resources', {})
|
||||
per_node[node]['cpu_req'] += parse_cpu(res.get('requests', {}).get('cpu', ''))
|
||||
per_node[node]['cpu_lim'] += parse_cpu(res.get('limits', {}).get('cpu', ''))
|
||||
per_node[node]['mem_req'] += parse_mem(res.get('requests', {}).get('memory', ''))
|
||||
per_node[node]['mem_lim'] += parse_mem(res.get('limits', {}).get('memory', ''))
|
||||
|
||||
for node in sorted(per_node.keys()):
|
||||
n = per_node[node]
|
||||
print(f\"{node}: cpu_req={fmt_cpu(n['cpu_req'])} cpu_lim={fmt_cpu(n['cpu_lim'])} mem_req={fmt_mem(n['mem_req'])} mem_lim={fmt_mem(n['mem_lim'])}\")
|
||||
" 2>/dev/null) || usage="Failed to get pod resource usage"
|
||||
|
||||
add_check "node-capacity" "ok" "Allocatable: ${report} | Usage: ${usage}"
|
||||
}
|
||||
|
||||
# Per-namespace ResourceQuota usage
|
||||
check_resource_quotas() {
|
||||
if $DRY_RUN; then
|
||||
add_check "resource-quotas" "ok" "DRY RUN: would check ResourceQuota usage per namespace"
|
||||
return
|
||||
fi
|
||||
|
||||
local quota_count
|
||||
quota_count=$($KUBECTL get resourcequota --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ') || quota_count=0
|
||||
|
||||
if [ "$quota_count" -eq 0 ]; then
|
||||
add_check "resource-quotas" "ok" "No ResourceQuotas defined in the cluster"
|
||||
return
|
||||
fi
|
||||
|
||||
local quota_report
|
||||
quota_report=$($KUBECTL get resourcequota --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
data = json.load(sys.stdin)
|
||||
results = []
|
||||
for rq in data.get('items', []):
|
||||
ns = rq['metadata']['namespace']
|
||||
name = rq['metadata']['name']
|
||||
hard = rq.get('status', {}).get('hard', {})
|
||||
used = rq.get('status', {}).get('used', {})
|
||||
for resource in hard:
|
||||
h = hard[resource]
|
||||
u = used.get(resource, '0')
|
||||
results.append(f'{ns}/{name}: {resource} used={u} hard={h}')
|
||||
if results:
|
||||
print('; '.join(results[:30]))
|
||||
else:
|
||||
print('No quota usage data')
|
||||
" 2>/dev/null) || quota_report="Failed to read ResourceQuotas"
|
||||
|
||||
add_check "resource-quotas" "ok" "$quota_report"
|
||||
}
|
||||
|
||||
# Top pods by memory usage
|
||||
check_top_consumers() {
|
||||
if $DRY_RUN; then
|
||||
add_check "top-consumers" "ok" "DRY RUN: would report top memory-consuming pods"
|
||||
return
|
||||
fi
|
||||
|
||||
local top_pods
|
||||
top_pods=$($KUBECTL top pods --all-namespaces --no-headers 2>/dev/null | sort -k4 -h -r | head -10 | awk '{print $1"/"$2": cpu="$3" mem="$4}' | tr '\n' '; ') || top_pods="metrics-server may not be available"
|
||||
|
||||
if [ -z "$top_pods" ]; then
|
||||
add_check "top-consumers" "warn" "kubectl top returned no data — metrics-server may not be running"
|
||||
else
|
||||
add_check "top-consumers" "ok" "Top 10 by memory: ${top_pods}"
|
||||
fi
|
||||
}
|
||||
|
||||
# Run all checks
|
||||
check_node_capacity
|
||||
check_resource_quotas
|
||||
check_top_consumers
|
||||
|
||||
# Determine overall status
|
||||
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||
import sys, json
|
||||
checks = json.load(sys.stdin)
|
||||
statuses = [c['status'] for c in checks]
|
||||
if 'fail' in statuses:
|
||||
print('fail')
|
||||
elif 'warn' in statuses:
|
||||
print('warn')
|
||||
else:
|
||||
print('ok')
|
||||
")
|
||||
|
||||
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||
|
|
@ -1,95 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# sev-context.sh — Gather structured cluster context for post-mortem triage
|
||||
# Used by sev-triage agent and available to all pipeline stages
|
||||
set -euo pipefail
|
||||
|
||||
KUBECONFIG="${KUBECONFIG:-/Users/viktorbarzin/code/infra/config}"
|
||||
INFRA_DIR="${INFRA_DIR:-/Users/viktorbarzin/code/infra}"
|
||||
export KUBECONFIG
|
||||
|
||||
echo "=== NODE STATUS ==="
|
||||
kubectl get nodes -o custom-columns=\
|
||||
'NAME:.metadata.name,STATUS:.status.conditions[?(@.type=="Ready")].status,VERSION:.status.nodeInfo.kubeletVersion,CPU_CAP:.status.capacity.cpu,MEM_CAP:.status.capacity.memory' \
|
||||
--no-headers 2>/dev/null || echo "ERROR: Cannot reach cluster"
|
||||
|
||||
echo ""
|
||||
echo "=== UNHEALTHY PODS ==="
|
||||
# Pods not Running/Succeeded, with UTC start time instead of relative age
|
||||
kubectl get pods --all-namespaces \
|
||||
--field-selector='status.phase!=Running,status.phase!=Succeeded' \
|
||||
-o custom-columns=\
|
||||
'NAMESPACE:.metadata.namespace,POD:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount,STARTED_UTC:.status.startTime,NODE:.spec.nodeName' \
|
||||
--no-headers 2>/dev/null || true
|
||||
|
||||
# Also show pods that are Running but have containers not ready or high restarts
|
||||
kubectl get pods --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||
import json, sys
|
||||
try:
|
||||
data = json.load(sys.stdin)
|
||||
except:
|
||||
sys.exit(0)
|
||||
for pod in data.get('items', []):
|
||||
ns = pod['metadata']['namespace']
|
||||
name = pod['metadata']['name']
|
||||
node = pod['spec'].get('nodeName', 'N/A')
|
||||
start = pod['status'].get('startTime', 'N/A')
|
||||
phase = pod['status'].get('phase', 'Unknown')
|
||||
if phase != 'Running':
|
||||
continue
|
||||
for cs in pod['status'].get('containerStatuses', []):
|
||||
restarts = cs.get('restartCount', 0)
|
||||
ready = cs.get('ready', True)
|
||||
if restarts > 3 or not ready:
|
||||
reason = ''
|
||||
waiting = cs.get('state', {}).get('waiting', {})
|
||||
if waiting:
|
||||
reason = waiting.get('reason', '')
|
||||
print(f'{ns}\t{name}\t{phase}/NotReady\t{restarts}\t{start}\t{node}\t{reason}')
|
||||
break
|
||||
" 2>/dev/null || true
|
||||
|
||||
echo ""
|
||||
echo "=== RECENT EVENTS (last 2h, Warning/Error only) ==="
|
||||
kubectl get events --all-namespaces \
|
||||
--field-selector='type!=Normal' \
|
||||
--sort-by='.lastTimestamp' \
|
||||
-o custom-columns=\
|
||||
'NAMESPACE:.metadata.namespace,TYPE:.type,REASON:.reason,OBJECT:.involvedObject.name,LAST_SEEN_UTC:.lastTimestamp,MESSAGE:.message' \
|
||||
--no-headers 2>/dev/null | tail -50 || true
|
||||
|
||||
echo ""
|
||||
echo "=== NAMESPACE TO STACK MAPPING ==="
|
||||
# Parse terragrunt.hcl files to map k8s namespaces to stack directories
|
||||
for tg in "$INFRA_DIR"/stacks/*/terragrunt.hcl; do
|
||||
stack_dir=$(dirname "$tg")
|
||||
stack_name=$(basename "$stack_dir")
|
||||
# Try to find namespace from the stack - check main.tf for namespace references
|
||||
ns=$(grep -h 'namespace' "$stack_dir"/main.tf 2>/dev/null | grep -oP '"\K[a-z0-9-]+(?=")' | head -1 || echo "$stack_name")
|
||||
echo "$ns → stacks/$stack_name"
|
||||
done 2>/dev/null | sort -u || true
|
||||
|
||||
echo ""
|
||||
echo "=== SERVICE TIERS ==="
|
||||
# Parse service-catalog.md for tier classifications
|
||||
catalog="$INFRA_DIR/.claude/reference/service-catalog.md"
|
||||
if [ -f "$catalog" ]; then
|
||||
current_tier=""
|
||||
while IFS= read -r line; do
|
||||
case "$line" in
|
||||
*"Tier: core"*) current_tier="core" ;;
|
||||
*"Tier: cluster"*) current_tier="cluster" ;;
|
||||
*"Admin"*) current_tier="admin" ;;
|
||||
*"Active Use"*) current_tier="active" ;;
|
||||
*"Optional"*|*"Inactive"*) current_tier="optional" ;;
|
||||
esac
|
||||
if [[ "$line" =~ ^\|[[:space:]]+([a-z0-9_-]+)[[:space:]]+\| && "$current_tier" != "" ]]; then
|
||||
svc="${BASH_REMATCH[1]}"
|
||||
[[ "$svc" == "Service" || "$svc" == "---" ]] && continue
|
||||
echo "$svc=$current_tier"
|
||||
fi
|
||||
done < "$catalog"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "=== CURRENT UTC TIME ==="
|
||||
date -u '+%Y-%m-%dT%H:%M:%SZ'
|
||||
|
|
@ -1,143 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||
AGENT="tls-check"
|
||||
DRY_RUN=false
|
||||
WARN_DAYS=14
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--dry-run) DRY_RUN=true ;;
|
||||
esac
|
||||
done
|
||||
|
||||
checks=()
|
||||
|
||||
add_check() {
|
||||
local name="$1" status="$2" message="$3"
|
||||
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||
}
|
||||
|
||||
check_tls_secrets() {
|
||||
if $DRY_RUN; then
|
||||
add_check "tls-secrets" "ok" "dry-run: would scan all kubernetes.io/tls secrets for expiry"
|
||||
return
|
||||
fi
|
||||
|
||||
local secrets_json
|
||||
secrets_json=$($KUBECTL get secrets -A -o json 2>/dev/null) || {
|
||||
add_check "tls-secrets" "fail" "Failed to list secrets"
|
||||
return
|
||||
}
|
||||
|
||||
local tls_secrets
|
||||
tls_secrets=$(echo "$secrets_json" | jq -r '.items[] | select(.type=="kubernetes.io/tls") | "\(.metadata.namespace)/\(.metadata.name)"' 2>/dev/null) || {
|
||||
add_check "tls-secrets" "fail" "Failed to parse secrets JSON"
|
||||
return
|
||||
}
|
||||
|
||||
if [ -z "$tls_secrets" ]; then
|
||||
add_check "tls-secrets" "warn" "No TLS secrets found"
|
||||
return
|
||||
fi
|
||||
|
||||
local total=0 expiring=0 expired=0 healthy=0 errors=0
|
||||
local now_epoch
|
||||
now_epoch=$(date +%s)
|
||||
local warn_epoch=$((now_epoch + WARN_DAYS * 86400))
|
||||
local expiring_list=""
|
||||
|
||||
while IFS= read -r secret; do
|
||||
total=$((total + 1))
|
||||
local ns="${secret%%/*}"
|
||||
local name="${secret##*/}"
|
||||
|
||||
local cert_pem
|
||||
cert_pem=$($KUBECTL get secret "$name" -n "$ns" -o jsonpath='{.data.tls\.crt}' 2>/dev/null | base64 -d 2>/dev/null) || {
|
||||
errors=$((errors + 1))
|
||||
continue
|
||||
}
|
||||
|
||||
local expiry_str
|
||||
expiry_str=$(echo "$cert_pem" | openssl x509 -noout -enddate 2>/dev/null | sed 's/notAfter=//') || {
|
||||
errors=$((errors + 1))
|
||||
continue
|
||||
}
|
||||
|
||||
local expiry_epoch
|
||||
expiry_epoch=$(date -j -f "%b %d %T %Y %Z" "$expiry_str" +%s 2>/dev/null || date -d "$expiry_str" +%s 2>/dev/null) || {
|
||||
errors=$((errors + 1))
|
||||
continue
|
||||
}
|
||||
|
||||
if [ "$expiry_epoch" -lt "$now_epoch" ]; then
|
||||
expired=$((expired + 1))
|
||||
expiring_list="${expiring_list}EXPIRED: ${ns}/${name}; "
|
||||
elif [ "$expiry_epoch" -lt "$warn_epoch" ]; then
|
||||
local days_left=$(( (expiry_epoch - now_epoch) / 86400 ))
|
||||
expiring=$((expiring + 1))
|
||||
expiring_list="${expiring_list}${days_left}d: ${ns}/${name}; "
|
||||
else
|
||||
healthy=$((healthy + 1))
|
||||
fi
|
||||
done <<< "$tls_secrets"
|
||||
|
||||
if [ "$expired" -gt 0 ]; then
|
||||
add_check "tls-secrets" "fail" "${expired} expired, ${expiring} expiring soon, ${healthy} healthy out of ${total} certs. ${expiring_list}"
|
||||
elif [ "$expiring" -gt 0 ]; then
|
||||
add_check "tls-secrets" "warn" "${expiring} expiring within ${WARN_DAYS}d, ${healthy} healthy out of ${total} certs. ${expiring_list}"
|
||||
else
|
||||
add_check "tls-secrets" "ok" "All ${healthy} TLS certs healthy (${errors} decode errors skipped)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_cert_manager() {
|
||||
if $DRY_RUN; then
|
||||
add_check "cert-manager" "ok" "dry-run: would check cert-manager pod health and certificate CRDs"
|
||||
return
|
||||
fi
|
||||
|
||||
local cm_pods
|
||||
cm_pods=$($KUBECTL get pods -n cert-manager -l app.kubernetes.io/instance=cert-manager --no-headers 2>/dev/null) || {
|
||||
add_check "cert-manager" "fail" "Failed to query cert-manager pods"
|
||||
return
|
||||
}
|
||||
|
||||
local not_running
|
||||
not_running=$(echo "$cm_pods" | grep -v "Running" | grep -v "Completed" | grep -c "." 2>/dev/null || echo "0")
|
||||
|
||||
if [ "$not_running" -gt 0 ]; then
|
||||
add_check "cert-manager" "fail" "${not_running} cert-manager pod(s) not running"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check for failed certificates
|
||||
local failed_certs
|
||||
failed_certs=$($KUBECTL get certificates -A -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | select(.type=="Ready" and .status=="False")) | "\(.metadata.namespace)/\(.metadata.name)"' 2>/dev/null) || {
|
||||
add_check "cert-manager" "warn" "Could not query certificate CRDs"
|
||||
return
|
||||
}
|
||||
|
||||
if [ -n "$failed_certs" ]; then
|
||||
local count
|
||||
count=$(echo "$failed_certs" | wc -l | tr -d ' ')
|
||||
add_check "cert-manager" "warn" "${count} certificate(s) not ready: $(echo "$failed_certs" | head -5 | tr '\n' ', ')"
|
||||
else
|
||||
add_check "cert-manager" "ok" "cert-manager healthy, all certificates ready"
|
||||
fi
|
||||
}
|
||||
|
||||
check_tls_secrets
|
||||
check_cert_manager
|
||||
|
||||
# Output JSON
|
||||
overall="ok"
|
||||
for c in "${checks[@]}"; do
|
||||
s=$(echo "$c" | jq -r '.status')
|
||||
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||
done
|
||||
|
||||
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||
|
|
@ -1,12 +0,0 @@
|
|||
{
|
||||
"project": {
|
||||
"name": "Home Infrastructure",
|
||||
"type": "terraform",
|
||||
"description": "Kubernetes cluster on Proxmox with self-hosted services"
|
||||
},
|
||||
"permissions": {
|
||||
"allow": [
|
||||
"Bash(ssh:*)"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -1,242 +0,0 @@
|
|||
---
|
||||
name: add-user
|
||||
description: |
|
||||
Add a new namespace-owner to the Kubernetes cluster. Use when:
|
||||
(1) "add user", "onboard user", "create user", "new namespace-owner",
|
||||
(2) someone new needs their own namespace and CI access,
|
||||
(3) user asks to set up cluster access for a person.
|
||||
Interactive: asks questions, updates Vault KV, applies stacks.
|
||||
---
|
||||
|
||||
# Add User
|
||||
|
||||
Add a new namespace-owner to the cluster. Two modes: **automated** (preferred) and **manual** (fallback).
|
||||
|
||||
SOPS state encryption access is **automatically provisioned** by the vault stack — per-stack Transit keys, policies, identity groups, and group aliases are all created from the `k8s_users` map. No manual SOPS setup required.
|
||||
|
||||
## Automated Flow (Preferred)
|
||||
|
||||
**Admin creates an Authentik invite → user signs up → provisioning happens automatically.**
|
||||
|
||||
### Steps
|
||||
|
||||
1. **Create Authentik Invitation**
|
||||
- Go to [Authentik Admin](https://authentik.viktorbarzin.me/if/admin/#/core/invitations)
|
||||
- Create a new invitation
|
||||
- Pre-assign the user to the **`kubernetes-namespace-owners`** group
|
||||
- Copy the invite link
|
||||
|
||||
2. **Send Invite Link to User**
|
||||
- The user clicks the link and signs up
|
||||
|
||||
3. **Automatic Provisioning (Vault KV + Authentik)**
|
||||
- Authentik fires a webhook to `webhook.viktorbarzin.me/authentik/provision`
|
||||
- The webhook handler validates the event and triggers the Woodpecker `provision-user` pipeline
|
||||
- Pipeline automatically:
|
||||
- Adds user to Vault KV (`secret/platform` → `k8s_users`) with convention defaults
|
||||
- Creates `sops-<username>` group in Authentik and assigns the user
|
||||
- Sends Slack notification with manual apply instructions
|
||||
|
||||
4. **Convention Defaults** (applied automatically)
|
||||
- Namespace: `username`
|
||||
- Quota: CPU 2, Memory 4Gi requests / 8Gi limits, 20 pods
|
||||
- Domains: none (user can request later)
|
||||
|
||||
5. **Manual Apply** (admin receives Slack notification)
|
||||
- The vault stack requires TLS certs (git-crypt) and can't run in CI. Apply manually:
|
||||
```bash
|
||||
cd /Users/viktorbarzin/code/infra
|
||||
cd stacks/vault && ../../scripts/tg apply --non-interactive && cd ../..
|
||||
cd stacks/rbac && ../../scripts/tg apply --non-interactive && cd ../..
|
||||
cd stacks/woodpecker && ../../scripts/tg apply --non-interactive && cd ../..
|
||||
```
|
||||
|
||||
6. **Post-Provisioning**
|
||||
- Send user the onboarding link: `https://k8s-portal.viktorbarzin.me/onboarding?role=namespace-owner`
|
||||
- If custom quota/domains needed, update Vault KV manually and re-apply stacks
|
||||
|
||||
### Monitoring the Pipeline
|
||||
|
||||
Watch the pipeline at: `https://ci.viktorbarzin.me` → infra repo → provision-user pipeline
|
||||
|
||||
## Manual Flow (Fallback)
|
||||
|
||||
Use when automated flow isn't available or custom configuration is needed.
|
||||
|
||||
### Step 1: Collect Information
|
||||
|
||||
Ask the user for ALL of the following before proceeding:
|
||||
|
||||
| Field | Question | Default |
|
||||
|-------|----------|---------|
|
||||
| `username` | Username (must match Forgejo username for CI) | — |
|
||||
| `email` | Email address (used for OIDC identity) | — |
|
||||
| `namespaces` | Namespace name(s) to create | `[username]` |
|
||||
| `domains` | Subdomain(s) under viktorbarzin.me for their apps | `[]` |
|
||||
| `cpu_requests` | CPU request quota | `"2"` |
|
||||
| `memory_requests` | Memory request quota | `"4Gi"` |
|
||||
| `memory_limits` | Memory limit quota | `"8Gi"` |
|
||||
| `pods` | Max pods | `"20"` |
|
||||
|
||||
Also confirm:
|
||||
- Has the user been added to the **`kubernetes-namespace-owners`** group in [Authentik](https://authentik.viktorbarzin.me)? (Manual step — admin must do this in the UI)
|
||||
- Has the user been added to the **`sops-USERNAME`** group in Authentik? (Required for terraform state decrypt — the vault stack creates the Vault external group, but the Authentik group must exist and the user must be in it)
|
||||
- Does the user need VPN access? If yes, also add to **`Headscale Users`** group in Authentik.
|
||||
|
||||
**Do NOT proceed until the Authentik group assignments are confirmed.**
|
||||
|
||||
### Step 2: Update Vault KV
|
||||
|
||||
Read the current `k8s_users` JSON from Vault, add the new entry, and write it back.
|
||||
|
||||
```bash
|
||||
# Ensure authenticated
|
||||
vault login -method=oidc
|
||||
|
||||
# Read current value
|
||||
vault kv get -format=json secret/platform | jq -r '.data.data.k8s_users' > /tmp/k8s_users.json
|
||||
|
||||
# Add the new user entry (use jq to merge)
|
||||
jq --arg user "USERNAME" \
|
||||
--arg email "EMAIL" \
|
||||
--argjson ns '["NAMESPACE"]' \
|
||||
--argjson domains '["DOMAIN1"]' \
|
||||
--argjson quota '{"cpu_requests":"2","memory_requests":"4Gi","memory_limits":"8Gi","pods":"20"}' \
|
||||
'. + {($user): {"role":"namespace-owner","email":$email,"namespaces":$ns,"domains":$domains,"quota":$quota}}' \
|
||||
/tmp/k8s_users.json > /tmp/k8s_users_updated.json
|
||||
|
||||
# Write back — must write the entire platform secret, not just k8s_users
|
||||
# First get all current keys
|
||||
vault kv get -format=json secret/platform | jq -r '.data.data' > /tmp/platform_secret.json
|
||||
|
||||
# Update k8s_users key with new JSON (as a string, since complex types are stored as JSON strings)
|
||||
jq --arg users "$(cat /tmp/k8s_users_updated.json)" '.k8s_users = $users' /tmp/platform_secret.json > /tmp/platform_updated.json
|
||||
|
||||
# Write back
|
||||
vault kv put secret/platform @/tmp/platform_updated.json
|
||||
|
||||
# Clean up
|
||||
rm -f /tmp/k8s_users.json /tmp/k8s_users_updated.json /tmp/platform_secret.json /tmp/platform_updated.json
|
||||
```
|
||||
|
||||
**Verify** the write:
|
||||
```bash
|
||||
vault kv get -field=k8s_users secret/platform | jq '.USERNAME'
|
||||
```
|
||||
|
||||
### Step 3: Apply Stacks
|
||||
|
||||
Apply in order. Use the `scripts/tg` wrapper.
|
||||
|
||||
```bash
|
||||
cd /Users/viktorbarzin/code/infra
|
||||
|
||||
# 1. Vault stack — creates namespace, Vault policy, identity entity, deployer role,
|
||||
# SOPS Transit key, SOPS policy, SOPS identity group + alias
|
||||
cd stacks/vault && ../../scripts/tg apply --non-interactive
|
||||
cd ../..
|
||||
|
||||
# 2. RBAC stack — creates RBAC bindings, ResourceQuota, TLS secret
|
||||
cd stacks/rbac && ../../scripts/tg apply --non-interactive
|
||||
cd ../..
|
||||
|
||||
# 3. Woodpecker stack — adds user to Woodpecker admin list
|
||||
cd stacks/woodpecker && ../../scripts/tg apply --non-interactive
|
||||
cd ../..
|
||||
```
|
||||
|
||||
### Step 4: Verify
|
||||
|
||||
```bash
|
||||
# Namespace exists
|
||||
kubectl get namespace USERNAME_NAMESPACE
|
||||
|
||||
# ResourceQuota applied
|
||||
kubectl describe resourcequota -n USERNAME_NAMESPACE
|
||||
|
||||
# Vault policy exists (namespace-owner + SOPS)
|
||||
vault policy read namespace-owner-USERNAME
|
||||
vault policy read sops-user-USERNAME
|
||||
|
||||
# Vault identity entity exists (with both policies)
|
||||
vault read identity/entity/name/USERNAME
|
||||
|
||||
# SOPS group exists
|
||||
vault read identity/group/name/sops-USERNAME
|
||||
|
||||
# K8s deployer role works
|
||||
vault write kubernetes/creds/NAMESPACE-deployer kubernetes_namespace=NAMESPACE
|
||||
|
||||
# SOPS Transit key exists
|
||||
vault read transit/keys/sops-state-NAMESPACE
|
||||
```
|
||||
|
||||
### Step 5: Notify User
|
||||
|
||||
Tell the user to share these onboarding instructions with the new user:
|
||||
- K8s Portal: `https://k8s-portal.viktorbarzin.me/onboarding?role=namespace-owner`
|
||||
- README: `https://github.com/ViktorBarzin/infra#new-user-onboarding`
|
||||
|
||||
**Web dashboard access** (auto-login, no token paste): the `rbac` stack
|
||||
auto-creates a `dashboard-<user>` SA + token for every namespace-owner
|
||||
(`dashboard-sa.tf`), and the **k8s-dashboard** stack's token-injector maps the
|
||||
user's Authentik identity → that token (`dashboard_injector.tf`, auto-derived
|
||||
from `k8s_users`). The new user just logs into `https://k8s.viktorbarzin.me` and
|
||||
lands in the dashboard scoped to their namespace (`admin` on their namespace +
|
||||
read-only on the namespace list & nodes for nav — no cross-tenant resource reads).
|
||||
|
||||
> **Apply order for a new namespace-owner:** after the vault/rbac/woodpecker
|
||||
> applies above, ALSO `cd stacks/k8s-dashboard && ../../scripts/tg apply` so the
|
||||
> injector map picks up the new user. (Manual token fallback:
|
||||
> `kubectl -n NAMESPACE get secret dashboard-USERNAME-token -o jsonpath='{.data.token}' | base64 -d`.)
|
||||
> Seamless OIDC SSO is built but blocked — see
|
||||
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12.
|
||||
|
||||
> **Auto-login works only for the user's `k8s_users` HOME namespace.** The
|
||||
> dashboard injects the user's `dashboard-<user>` SA token, which the `rbac`
|
||||
> stack binds to `admin` on their home namespace only. If their workload lives
|
||||
> in a DIFFERENT / pre-existing namespace (e.g. gheorghe's app is in `novelapp`,
|
||||
> not his home `vabbit81`), that namespace's stack must ALSO grant their
|
||||
> **dashboard SA** — `kind: ServiceAccount, name: dashboard-<user>, namespace:
|
||||
> <home-ns>` — not just their OIDC `User` email (the dashboard uses the SA, and
|
||||
> apiserver OIDC is blocked). See `stacks/novelapp/main.tf` `novelapp_owner_vabbit81`
|
||||
> for the pattern (two subjects: User + SA). Best practice: set the user's
|
||||
> `k8s_users` namespace to where their workload actually runs, so the home-ns
|
||||
> auto-path covers them with no extra binding.
|
||||
|
||||
The user can decrypt their stack's state with:
|
||||
```bash
|
||||
vault login -method=oidc # authenticates via Authentik SSO
|
||||
scripts/state-sync decrypt NAMESPACE # decrypts only their stack
|
||||
```
|
||||
|
||||
## What Gets Auto-Generated
|
||||
|
||||
| Resource | Stack | Driven by |
|
||||
|----------|-------|-----------|
|
||||
| Kubernetes namespace | vault | `namespaces` list |
|
||||
| Vault policy (`namespace-owner-{user}`) | vault | user key |
|
||||
| Vault identity entity + OIDC alias | vault | user email |
|
||||
| K8s deployer Role + Vault K8s role | vault | `namespaces` list |
|
||||
| **SOPS Transit key** (`sops-state-{ns}`) | vault | `namespaces` list |
|
||||
| **SOPS Vault policy** (`sops-user-{user}`) | vault | user key + namespaces |
|
||||
| **SOPS identity group** (`sops-{user}`) | vault | user key |
|
||||
| **SOPS group alias** (maps Authentik group) | vault | user key |
|
||||
| RBAC RoleBinding (namespace admin) | rbac | `namespaces` list |
|
||||
| RBAC ClusterRoleBinding (cluster read-only) | rbac | user role |
|
||||
| ResourceQuota | rbac | `quota` object |
|
||||
| TLS secret in namespace | rbac | `namespaces` list |
|
||||
| Cloudflare DNS records | cloudflared | `domains` list |
|
||||
| Woodpecker admin access | woodpecker | user key |
|
||||
|
||||
## Checklist (Manual Flow)
|
||||
|
||||
- [ ] Authentik: user added to `kubernetes-namespace-owners` group
|
||||
- [ ] Authentik: user added to `sops-USERNAME` group (for SOPS state decrypt)
|
||||
- [ ] Authentik: user added to `Headscale Users` group (if VPN needed)
|
||||
- [ ] Vault KV: `k8s_users` entry added to `secret/platform`
|
||||
- [ ] Vault stack applied — namespace + policy + identity + deployer role + SOPS Transit key + SOPS policy + SOPS group created
|
||||
- [ ] RBAC stack applied — RBAC + quota + TLS created
|
||||
- [ ] Woodpecker stack applied — admin list updated
|
||||
- [ ] Verification: namespace, quota, policies (namespace-owner + sops-user), deployer role, Transit key all confirmed
|
||||
- [ ] User notified with onboarding link
|
||||
|
|
@ -1,170 +0,0 @@
|
|||
---
|
||||
name: authentik-oidc-kubernetes
|
||||
description: |
|
||||
Configure Authentik as OIDC provider for Kubernetes API server authentication.
|
||||
Use when: (1) setting up OIDC auth for kubectl with Authentik, (2) kube-apiserver
|
||||
rejects OIDC tokens with "oidc: email not verified", (3) JWKS endpoint returns
|
||||
empty {} despite provider being configured, (4) kubelogin fails with "claim not
|
||||
present" for email, (5) redirect_uri mismatch errors during kubelogin browser auth,
|
||||
(6) kube-apiserver static pod manifest changes don't take effect after restart.
|
||||
Covers all gotchas discovered when integrating Authentik 2025.10.x with Kubernetes
|
||||
1.34.x using kubelogin (int128/kubelogin).
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Authentik OIDC for Kubernetes API Authentication
|
||||
|
||||
## Problem
|
||||
Setting up Authentik as an OIDC identity provider for Kubernetes kubectl access
|
||||
involves multiple non-obvious pitfalls that cause silent failures at different
|
||||
stages of the authentication flow.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Setting up multi-user kubectl access with OIDC
|
||||
- Using Authentik as the identity provider and kubelogin (int128/kubelogin) as the kubectl plugin
|
||||
- Any of these errors:
|
||||
- `oidc: email not verified`
|
||||
- `oidc: parse username claims "email": claim not present`
|
||||
- `The request fails due to a missing, invalid, or mismatching redirection URI`
|
||||
- JWKS endpoint (`/application/o/<app>/jwks/`) returns `{}`
|
||||
- `Unauthorized` after successful browser login
|
||||
|
||||
## Solution
|
||||
|
||||
### Gotcha 1: Signing Key Must Be Assigned
|
||||
|
||||
Authentik's OAuth2 provider does NOT assign a signing key by default. Without it,
|
||||
the JWKS endpoint returns `{}` and kube-apiserver can't validate tokens.
|
||||
|
||||
**Fix:** Assign a signing key (e.g., "authentik Self-signed Certificate") to the
|
||||
OAuth2 provider:
|
||||
```python
|
||||
# Via Django shell (kubectl exec into authentik server pod)
|
||||
from authentik.providers.oauth2.models import OAuth2Provider
|
||||
from authentik.crypto.models import CertificateKeyPair
|
||||
|
||||
provider = OAuth2Provider.objects.get(name='kubernetes')
|
||||
cert = CertificateKeyPair.objects.filter(name='authentik Self-signed Certificate').first()
|
||||
provider.signing_key = cert
|
||||
provider.save()
|
||||
```
|
||||
|
||||
Or via API:
|
||||
```bash
|
||||
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
|
||||
"$AUTHENTIK_URL/api/v3/providers/oauth2/<pk>/" \
|
||||
-d '{"signing_key": "<certificate-keypair-uuid>"}'
|
||||
```
|
||||
|
||||
### Gotcha 2: Default Email Mapping Sets `email_verified: False`
|
||||
|
||||
Authentik's built-in email scope mapping hardcodes `email_verified: False`:
|
||||
```python
|
||||
return {
|
||||
"email": request.user.email,
|
||||
"email_verified": False # <-- This causes kube-apiserver to reject the token
|
||||
}
|
||||
```
|
||||
|
||||
kube-apiserver requires `email_verified: true` by default.
|
||||
|
||||
**Fix:** Create a custom scope mapping with `email_verified: True` and assign it
|
||||
to the provider instead of the default:
|
||||
```python
|
||||
from authentik.providers.oauth2.models import OAuth2Provider, ScopeMapping
|
||||
|
||||
# Create custom mapping
|
||||
mapping, _ = ScopeMapping.objects.get_or_create(
|
||||
name='Kubernetes Email (verified)',
|
||||
defaults={
|
||||
'scope_name': 'email',
|
||||
'expression': 'return {"email": request.user.email, "email_verified": True}'
|
||||
}
|
||||
)
|
||||
|
||||
# Replace default email mapping on the provider
|
||||
provider = OAuth2Provider.objects.get(name='kubernetes')
|
||||
default_email = ScopeMapping.objects.filter(
|
||||
managed='goauthentik.io/providers/oauth2/scope-email'
|
||||
).first()
|
||||
if default_email:
|
||||
provider.property_mappings.remove(default_email)
|
||||
provider.property_mappings.add(mapping)
|
||||
```
|
||||
|
||||
### Gotcha 3: kubelogin Needs Extra Scopes
|
||||
|
||||
By default, kubelogin only requests the `openid` scope. The token will lack
|
||||
`email` and `groups` claims, causing:
|
||||
```
|
||||
oidc: parse username claims "email": claim not present
|
||||
```
|
||||
|
||||
**Fix:** Add `--oidc-extra-scope` flags to the kubeconfig exec plugin:
|
||||
```yaml
|
||||
users:
|
||||
- name: oidc-user
|
||||
user:
|
||||
exec:
|
||||
command: kubectl
|
||||
args:
|
||||
- oidc-login
|
||||
- get-token
|
||||
- --oidc-issuer-url=https://authentik.example.com/application/o/kubernetes/
|
||||
- --oidc-client-id=kubernetes
|
||||
- --oidc-extra-scope=email # Required!
|
||||
- --oidc-extra-scope=profile
|
||||
- --oidc-extra-scope=groups
|
||||
```
|
||||
|
||||
### Gotcha 4: Redirect URIs Must Use Regex Mode
|
||||
|
||||
kubelogin picks a random available port (tries 8000, 18000, then random).
|
||||
Strict redirect URI matching like `http://localhost:8000/callback` will fail
|
||||
when kubelogin uses a different port.
|
||||
|
||||
**Fix:** Use regex matching in the Authentik provider:
|
||||
```json
|
||||
{
|
||||
"redirect_uris": [
|
||||
{"matching_mode": "regex", "url": "http://localhost:.*"},
|
||||
{"matching_mode": "regex", "url": "http://127\\.0\\.0\\.1:.*"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Gotcha 5: Property Mappings API Endpoint Changed
|
||||
|
||||
In Authentik 2025.10.x, scope mappings are at:
|
||||
- `propertymappings/provider/scope/` (new, correct)
|
||||
- NOT `propertymappings/scope/` (old, returns 405 Method Not Allowed on POST)
|
||||
|
||||
### Gotcha 6: Static Pod Manifest Changes Need Full Cycle
|
||||
|
||||
See skill: `kubelet-static-pod-manifest-update` for the full restart procedure.
|
||||
|
||||
## Verification
|
||||
|
||||
After all fixes:
|
||||
```bash
|
||||
# 1. JWKS has a key
|
||||
curl -s https://authentik.example.com/application/o/kubernetes/jwks/ | jq '.keys | length'
|
||||
# Expected: 1 (or more)
|
||||
|
||||
# 2. Test auth
|
||||
KUBECONFIG=/path/to/oidc-kubeconfig kubectl get namespaces
|
||||
# Expected: browser opens, login, namespaces returned
|
||||
|
||||
# 3. Check API server logs for success
|
||||
ssh master "sudo kubectl logs -n kube-system kube-apiserver-* | grep oidc | tail -5"
|
||||
# Expected: no "Unable to authenticate" errors
|
||||
```
|
||||
|
||||
## Notes
|
||||
- The OAuth2 provider should use `client_type: public` (no client secret needed for kubelogin)
|
||||
- Set `sub_mode: user_email` so the OIDC subject matches the RBAC binding
|
||||
- Set `include_claims_in_id_token: true` for the token to contain claims directly
|
||||
- Use `issuer_mode: per_provider` for a clean issuer URL
|
||||
- RBAC ClusterRoleBindings should match on the user's email (the `--oidc-username-claim=email` value)
|
||||
|
|
@ -1,297 +0,0 @@
|
|||
---
|
||||
name: authentik
|
||||
description: |
|
||||
Manage the Authentik identity provider via its REST API. Use when:
|
||||
(1) User asks to create, update, or delete users in Authentik,
|
||||
(2) User asks to manage groups or group memberships,
|
||||
(3) User asks to create a new OAuth2/OIDC application or provider,
|
||||
(4) User asks to protect a service with forward auth (Authentik + Traefik),
|
||||
(5) User asks about SSO, single sign-on, authentication, or identity,
|
||||
(6) User asks to manage Authentik flows, stages, or policies,
|
||||
(7) User asks to configure social login (Google, GitHub, Facebook),
|
||||
(8) User asks about OIDC for Kubernetes or who has access to what,
|
||||
(9) User deploys a new service that needs authentication.
|
||||
Authentik v2025.10.3 running in Kubernetes, managed via REST API.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Authentik Identity Provider Management
|
||||
|
||||
## Overview
|
||||
- **URL**: `https://authentik.viktorbarzin.me`
|
||||
- **Admin UI**: `https://authentik.viktorbarzin.me/if/admin/`
|
||||
- **API Base**: `https://authentik.viktorbarzin.me/api/v3/`
|
||||
- **API Docs**: `https://authentik.viktorbarzin.me/api/v3/docs/`
|
||||
- **Helm Chart**: authentik v2025.10.3
|
||||
- **Namespace**: `authentik`
|
||||
|
||||
## API Access
|
||||
|
||||
### Getting the Token
|
||||
The API token is stored in `terraform.tfvars` (git-crypt encrypted):
|
||||
```bash
|
||||
AUTHENTIK_TOKEN=$(grep authentik_api_token terraform.tfvars | cut -d'"' -f2)
|
||||
```
|
||||
|
||||
### Making API Calls
|
||||
```bash
|
||||
# Generic pattern
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/<endpoint>/"
|
||||
|
||||
# With JSON body (POST/PATCH/PUT)
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/<endpoint>/" \
|
||||
-d '{"key": "value"}'
|
||||
```
|
||||
|
||||
### Verify Token Works
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/users/me/" | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Key API Endpoints
|
||||
|
||||
| Endpoint | Methods | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `core/users/` | GET, POST | List/create users |
|
||||
| `core/users/{id}/` | GET, PATCH, DELETE | Get/update/delete user |
|
||||
| `core/groups/` | GET, POST | List/create groups |
|
||||
| `core/groups/{pk}/` | GET, PATCH, DELETE | Get/update/delete group |
|
||||
| `core/applications/` | GET, POST | List/create applications |
|
||||
| `core/tokens/` | GET, POST | List/create tokens |
|
||||
| `core/tokens/{identifier}/view_key/` | GET | View token secret key |
|
||||
| `providers/all/` | GET | List all providers |
|
||||
| `providers/oauth2/` | GET, POST | OAuth2/OIDC providers |
|
||||
| `providers/proxy/` | GET, POST | Proxy providers (forward auth) |
|
||||
| `flows/instances/` | GET | List flows |
|
||||
| `stages/all/` | GET | List stages |
|
||||
| `sources/all/` | GET | List sources (social login) |
|
||||
| `outposts/instances/` | GET | List outposts |
|
||||
| `propertymappings/provider/scope/` | GET, POST | OIDC scope mappings |
|
||||
| `rbac/roles/` | GET | List roles |
|
||||
|
||||
## Common Operations
|
||||
|
||||
### List All Users
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/users/?page_size=50" | \
|
||||
python3 -c "
|
||||
import json,sys
|
||||
for u in json.load(sys.stdin)['results']:
|
||||
groups=[g['name'] for g in u.get('groups_obj',[])]
|
||||
print(f\" {u['username']:<40} {u['name']:<30} groups={groups}\")
|
||||
"
|
||||
```
|
||||
|
||||
### Create a New User
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/users/" \
|
||||
-d '{
|
||||
"username": "user@example.com",
|
||||
"name": "Full Name",
|
||||
"email": "user@example.com",
|
||||
"is_active": true,
|
||||
"type": "internal",
|
||||
"path": "users"
|
||||
}'
|
||||
```
|
||||
|
||||
### Add User to Group
|
||||
```bash
|
||||
# First get the group to find current users
|
||||
GROUP_PK="<group-uuid>"
|
||||
CURRENT_USERS=$(curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" | \
|
||||
python3 -c "import json,sys; print(json.load(sys.stdin)['users'])")
|
||||
|
||||
# Then PATCH with the updated user list (add new user pk)
|
||||
curl -s -X PATCH \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" \
|
||||
-d '{"users": [<existing_pks>, <new_pk>]}'
|
||||
```
|
||||
|
||||
### Create a New Group
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/" \
|
||||
-d '{
|
||||
"name": "My New Group",
|
||||
"is_superuser": false,
|
||||
"parent": "<parent-group-pk-or-null>"
|
||||
}'
|
||||
```
|
||||
|
||||
### Create OAuth2/OIDC Application (Full Flow)
|
||||
|
||||
**Step 1: Create the OAuth2 Provider**
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/providers/oauth2/" \
|
||||
-d '{
|
||||
"name": "Provider for myapp",
|
||||
"authorization_flow": "<flow-pk>",
|
||||
"invalidation_flow": "<invalidation-flow-pk>",
|
||||
"client_type": "confidential",
|
||||
"client_id": "<generated-or-custom>",
|
||||
"client_secret": "<generated-or-custom>",
|
||||
"redirect_uris": "https://myapp.viktorbarzin.me/callback",
|
||||
"property_mappings": ["<scope-mapping-pks>"],
|
||||
"signing_key": "<signing-key-pk>"
|
||||
}'
|
||||
```
|
||||
|
||||
**Step 2: Create the Application**
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/applications/" \
|
||||
-d '{
|
||||
"name": "My App",
|
||||
"slug": "myapp",
|
||||
"provider": <provider-pk-from-step-1>,
|
||||
"meta_launch_url": "https://myapp.viktorbarzin.me"
|
||||
}'
|
||||
```
|
||||
|
||||
### List Applications
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/applications/?page_size=50" | \
|
||||
python3 -c "
|
||||
import json,sys
|
||||
for a in json.load(sys.stdin)['results']:
|
||||
ptype = a.get('provider_obj',{}).get('verbose_name','N/A')
|
||||
print(f\" {a['name']:<30} slug={a['slug']:<25} provider={ptype}\")
|
||||
"
|
||||
```
|
||||
|
||||
### Create a Non-Expiring API Token
|
||||
```bash
|
||||
# Create token
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/tokens/" \
|
||||
-d '{
|
||||
"identifier": "my-token-name",
|
||||
"intent": "api",
|
||||
"expiring": false,
|
||||
"description": "Description here"
|
||||
}'
|
||||
|
||||
# Retrieve the key
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/tokens/my-token-name/view_key/"
|
||||
```
|
||||
|
||||
## Important Reference UUIDs
|
||||
|
||||
### Authorization Flows
|
||||
| Flow | Slug | Use For |
|
||||
|------|------|---------|
|
||||
| Authorize Application (explicit consent) | `default-provider-authorization-explicit-consent` | Apps that should show consent screen |
|
||||
| Authorize Application (implicit consent) | `default-provider-authorization-implicit-consent` | Internal/trusted apps, auto-redirect |
|
||||
| Logout | `default-invalidation-flow` | Invalidation/logout flow |
|
||||
|
||||
### Common Property Mappings (OIDC Scopes)
|
||||
These are the standard scope mappings used by most providers:
|
||||
- `60e33a8c-66a2-414f-840c-b13012b4d4bd` — openid
|
||||
- `1f51c659-f13b-4ad4-ba89-70458ef88e9c` — email
|
||||
- `4c0bf430-7f74-4216-b9d7-23703ab544ba` — profile
|
||||
|
||||
### Login Sources
|
||||
| Source | Slug | Matching Mode |
|
||||
|--------|------|---------------|
|
||||
| Google | `google` | identifier |
|
||||
| GitHub | `github` | email_link |
|
||||
| Facebook | `facebook` | email_link |
|
||||
|
||||
## Protecting a Service with Forward Auth
|
||||
|
||||
To protect a service via Authentik + Traefik forward auth:
|
||||
|
||||
1. In the service's Terraform module, set `protected = true` in the `ingress_factory` call
|
||||
2. This adds the `authentik-forward-auth` Traefik middleware
|
||||
3. Unauthenticated users get redirected to the Authentik login page
|
||||
4. After login, these headers are forwarded to the service:
|
||||
- `X-authentik-username`
|
||||
- `X-authentik-uid`
|
||||
- `X-authentik-email`
|
||||
- `X-authentik-name`
|
||||
- `X-authentik-groups`
|
||||
|
||||
## Invitation Management
|
||||
|
||||
### Create Invitation
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/" \
|
||||
-d '{
|
||||
"name": "invite-slug-name",
|
||||
"single_use": true,
|
||||
"fixed_data": {"group": "Target Group Name"},
|
||||
"flow": "<invitation-enrollment-flow-pk>"
|
||||
}'
|
||||
# Returns PK which is the itoken
|
||||
# Link: https://authentik.viktorbarzin.me/if/flow/invitation-enrollment/?itoken=<pk>
|
||||
```
|
||||
|
||||
### List Invitations
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/?page_size=50"
|
||||
```
|
||||
|
||||
### Delete Invitation
|
||||
```bash
|
||||
curl -s -X DELETE -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/<pk>/"
|
||||
```
|
||||
|
||||
### Helper Script
|
||||
Use `.claude/scripts/authentik-invite.sh` for invitation management:
|
||||
```bash
|
||||
./authentik-invite.sh create "Group Name" [--days N]
|
||||
./authentik-invite.sh assign <username> "Group Name"
|
||||
./authentik-invite.sh list
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
- OAuth source `enrollment_flow` is set to `invitation-enrollment` -- new social login users require invitation
|
||||
- Source updates require Django ORM (PATCH not supported on `sources/oauth/<slug>/`)
|
||||
- Invitation `name` field must be a slug (letters, numbers, hyphens, underscores)
|
||||
|
||||
## Gotchas
|
||||
|
||||
1. **API pagination**: All list endpoints return paginated results. Use `?page_size=50` or check `pagination.next` for more pages.
|
||||
2. **Group user updates**: PATCH to groups replaces the entire user list — always fetch current users first, then append.
|
||||
3. **Provider property mappings**: Must reference existing scope mapping UUIDs. Query `propertymappings/provider/scope/` to find them.
|
||||
4. **Signing key for OIDC**: Must assign a signing key to OAuth2 providers or JWKS endpoint returns empty `{}`.
|
||||
5. **Email verified claim**: Default email scope mapping sets `email_verified: False`. For Kubernetes OIDC, create a custom mapping that returns `True`.
|
||||
6. **Token identifier uniqueness**: Token identifiers must be unique across the entire instance.
|
||||
|
||||
## Notes
|
||||
- Authentik is classified as DEFCON Level 1 (Critical) — handle with care
|
||||
- Changes to Authentik configuration (Helm chart, PgBouncer, etc.) must go through Terraform
|
||||
- API-level changes (users, groups, applications) are fine to make directly via the API
|
||||
- The embedded outpost auto-discovers providers assigned to it
|
||||
- See also: `ingress-factory-migration` skill for protecting services
|
||||
|
|
@ -1,175 +0,0 @@
|
|||
---
|
||||
name: bluestacks-burp-interception
|
||||
description: |
|
||||
Intercept Android app HTTPS traffic using BlueStacks and Burp Suite on macOS.
|
||||
Use when: (1) Need to analyze Android app API calls, (2) App ignores HTTP proxy,
|
||||
(3) App uses SSL pinning that blocks interception, (4) Need to install Burp CA
|
||||
as system certificate. Covers ADB setup, proxy configuration, Zygisk SSL unpinning,
|
||||
and Magisk trustusercerts module for system CA installation.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-01-24
|
||||
---
|
||||
|
||||
# BlueStacks + Burp Suite HTTPS Traffic Interception
|
||||
|
||||
## Problem
|
||||
You want to intercept HTTPS traffic from an Android app running in BlueStacks to analyze
|
||||
API calls, but the app either ignores the proxy or uses SSL certificate pinning.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Running BlueStacks on macOS with Burp Suite
|
||||
- App traffic not appearing in Burp Suite
|
||||
- App crashes or refuses to connect when proxy is set
|
||||
- Need to bypass SSL pinning for security testing/research
|
||||
|
||||
## Prerequisites
|
||||
- BlueStacks with Magisk (kitsune variant) and root enabled
|
||||
- Zygisk-SSL-Unpinning module installed
|
||||
- trustusercerts Magisk module installed
|
||||
- Android SDK installed (for ADB)
|
||||
- Burp Suite running on port 8080
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Connect ADB to BlueStacks
|
||||
|
||||
```bash
|
||||
# ADB location on macOS (Android SDK)
|
||||
ADB=~/Library/Android/sdk/platform-tools/adb
|
||||
|
||||
# Connect to BlueStacks
|
||||
$ADB connect localhost:5555
|
||||
|
||||
# Verify connection
|
||||
$ADB devices
|
||||
# Should show: emulator-5554 or localhost:5555
|
||||
```
|
||||
|
||||
Note: BlueStacks runs **arm64-v8a** (not x86 as you might expect).
|
||||
|
||||
### Step 2: Set HTTP Proxy
|
||||
|
||||
Use your Mac's WiFi IP address (not 10.0.2.2 or localhost):
|
||||
|
||||
```bash
|
||||
# Get Mac WiFi IP
|
||||
IP=$(ipconfig getifaddr en0)
|
||||
|
||||
# Set proxy (Burp default port 8080)
|
||||
$ADB shell settings put global http_proxy ${IP}:8080
|
||||
|
||||
# Verify
|
||||
$ADB shell settings get global http_proxy
|
||||
|
||||
# Disable proxy when done
|
||||
$ADB shell settings put global http_proxy :0
|
||||
```
|
||||
|
||||
### Step 3: Configure SSL Unpinning for Target App
|
||||
|
||||
```bash
|
||||
# Find app package name
|
||||
$ADB shell pm list packages | grep <keyword>
|
||||
|
||||
# Edit config
|
||||
$ADB shell "su -c 'cat > /data/local/tmp/zyg.ssl/config.json << EOF
|
||||
{
|
||||
\"targets\": [
|
||||
{
|
||||
\"pkg_name\" : \"com.example.app\",
|
||||
\"enable\": true,
|
||||
\"start_safe\": true,
|
||||
\"start_delay\": 1000
|
||||
}
|
||||
]
|
||||
}
|
||||
EOF'"
|
||||
|
||||
# Restart the app
|
||||
$ADB shell am force-stop com.example.app
|
||||
$ADB shell monkey -p com.example.app -c android.intent.category.LAUNCHER 1
|
||||
|
||||
# Verify SSL unpinning is active
|
||||
$ADB shell "logcat -d | grep -i ZygiskSSL | tail -10"
|
||||
# Should show: "App detected: com.example.app" and "[*] SSL UNPINNING [#]"
|
||||
```
|
||||
|
||||
### Step 4: Install Burp CA as System Certificate
|
||||
|
||||
```bash
|
||||
# Download Burp CA cert
|
||||
curl -x http://127.0.0.1:8080 http://burp/cert -o /tmp/burp-cert.der
|
||||
|
||||
# Convert to PEM
|
||||
openssl x509 -inform DER -in /tmp/burp-cert.der -out /tmp/burp-cert.pem
|
||||
|
||||
# Get hash for Android cert store naming
|
||||
HASH=$(openssl x509 -inform PEM -subject_hash_old -in /tmp/burp-cert.pem | head -1)
|
||||
cp /tmp/burp-cert.pem /tmp/${HASH}.0
|
||||
|
||||
# Push to device
|
||||
$ADB push /tmp/${HASH}.0 /sdcard/
|
||||
|
||||
# Install via trustusercerts Magisk module
|
||||
$ADB shell "su -c 'cp /sdcard/${HASH}.0 /data/adb/modules/trustusercerts/system/etc/security/cacerts/'"
|
||||
$ADB shell "su -c 'chmod 644 /data/adb/modules/trustusercerts/system/etc/security/cacerts/${HASH}.0'"
|
||||
|
||||
# Reboot required for Magisk overlay
|
||||
$ADB shell "su -c 'reboot'"
|
||||
|
||||
# After reboot, verify cert is in system store
|
||||
$ADB shell "su -c 'ls /system/etc/security/cacerts/${HASH}.0'"
|
||||
```
|
||||
|
||||
### Step 5: Test Interception
|
||||
|
||||
1. Re-enable proxy after reboot: `$ADB shell settings put global http_proxy ${IP}:8080`
|
||||
2. Launch target app
|
||||
3. Check Burp Suite → Proxy → HTTP history for requests
|
||||
|
||||
## Verification
|
||||
|
||||
- Proxy set: `adb shell settings get global http_proxy` returns `<ip>:8080`
|
||||
- SSL unpinning active: `logcat | grep ZygiskSSL` shows "SSL UNPINNING"
|
||||
- Burp CA installed: `ls /system/etc/security/cacerts/<hash>.0` exists
|
||||
- Traffic visible in Burp Suite HTTP history
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| No traffic in Burp | Proxy not set | Check `settings get global http_proxy` |
|
||||
| App shows SSL error | Cert not installed | Verify cert in system store, reboot |
|
||||
| SSL unpinning not working | Config not loaded | Force-stop app, check config.json syntax |
|
||||
| ADB connection refused | BlueStacks ADB disabled | Enable in BlueStacks Settings → Advanced |
|
||||
| Wrong cert hash | Using wrong openssl flag | Use `subject_hash_old` not `subject_hash` |
|
||||
|
||||
## Notes
|
||||
|
||||
- BlueStacks runs arm64-v8a, so Zygisk modules need arm64 support
|
||||
- The trustusercerts module copies certs at boot via Magisk overlay
|
||||
- System partition is read-only; use Magisk modules instead of direct mounting
|
||||
- Burp cert hash is typically `9a5ba575` but verify for your instance
|
||||
- Some apps may use additional protections (root detection, Frida detection)
|
||||
|
||||
## Quick Reference
|
||||
|
||||
```bash
|
||||
# Set proxy
|
||||
adb shell settings put global http_proxy <ip>:8080
|
||||
|
||||
# Disable proxy
|
||||
adb shell settings put global http_proxy :0
|
||||
|
||||
# Check SSL unpinning logs
|
||||
adb shell "logcat -d | grep -i ZygiskSSL"
|
||||
|
||||
# Force restart app
|
||||
adb shell am force-stop <package> && adb shell monkey -p <package> -c android.intent.category.LAUNCHER 1
|
||||
```
|
||||
|
||||
## References
|
||||
- [Zygisk-SSL-Unpinning](https://github.com/m0szy/Zygisk-SSL-Unpinning)
|
||||
- [MagiskTrustUserCerts](https://github.com/NVISOsecurity/MagiskTrustUserCerts)
|
||||
- [Burp Suite Documentation](https://portswigger.net/burp/documentation)
|
||||
|
|
@ -1,189 +0,0 @@
|
|||
---
|
||||
name: clickhouse-k8s-nfs-system-log-bloat
|
||||
description: |
|
||||
Fix for ClickHouse consuming excessive CPU (500m-1000m+) on Kubernetes when running on
|
||||
NFS storage, caused by unbounded system log table growth triggering continuous background
|
||||
merges. Use when: (1) ClickHouse burns ~1 CPU core with no active user queries,
|
||||
(2) system.merges shows constant merge activity on system.metric_log or system.trace_log,
|
||||
(3) system log tables (metric_log, trace_log, text_log, asynchronous_metric_log) have
|
||||
grown to gigabytes while actual user data is tiny, (4) ClickHouse crashes with exit code
|
||||
76 (loadOutdatedDataParts SIGSEGV), (5) attempting to mount custom config.d XML via
|
||||
Kubernetes ConfigMap causes exit code 36 (BAD_ARGUMENTS) crashes. Also covers why
|
||||
ClickHouse's MergeTree engine performs poorly on NFS and the CronJob workaround for
|
||||
system log truncation.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# ClickHouse on Kubernetes/NFS: System Log Bloat & CPU Overhead
|
||||
|
||||
## Problem
|
||||
|
||||
ClickHouse deployed on Kubernetes with NFS storage consumes ~1 CPU core continuously,
|
||||
even when actual user queries are negligible. The CPU is consumed by background merge
|
||||
operations on system log tables that grow unboundedly with no default TTL.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- ClickHouse pod using 500m-1000m+ CPU with no active user queries
|
||||
- `SELECT * FROM system.processes` shows only diagnostic queries
|
||||
- `SELECT * FROM system.merges` shows constant merge activity on `system.metric_log`
|
||||
- System log tables have grown to gigabytes:
|
||||
- `system.trace_log`: 5+ GiB, 200M+ rows
|
||||
- `system.text_log`: 3+ GiB, 90M+ rows
|
||||
- `system.metric_log`: 1+ GiB with 80-100+ active parts (healthy is <20)
|
||||
- `system.asynchronous_metric_log`: 500+ MiB, 1B+ rows
|
||||
- Actual user data (e.g., `clickhouse.events`) is only kilobytes
|
||||
- ClickHouse crashes periodically with exit code 76 (`loadOutdatedDataParts` SIGSEGV)
|
||||
- Data directory is on NFS (e.g., `/mnt/main/clickhouse`)
|
||||
|
||||
## Root Cause
|
||||
|
||||
Two compounding issues:
|
||||
|
||||
1. **No TTL on system log tables**: ClickHouse system tables (`metric_log`, `trace_log`,
|
||||
`text_log`, `asynchronous_metric_log`, `query_log`, `part_log`) have no default
|
||||
retention policy and grow indefinitely.
|
||||
|
||||
2. **NFS amplifies merge overhead**: ClickHouse's MergeTree engine relies on background
|
||||
merge operations that involve heavy sequential I/O. NFS latency makes merges 10-100x
|
||||
slower than local disk, creating a feedback loop:
|
||||
- Slow merges → parts accumulate faster than they can be merged
|
||||
- More parts → more merge operations spawned
|
||||
- More merges → more CPU for decompression/recompression while waiting on NFS I/O
|
||||
|
||||
## Solution
|
||||
|
||||
### Immediate Fix: Truncate System Tables
|
||||
|
||||
```bash
|
||||
CH_POD=$(kubectl get pod -n <namespace> -l app=clickhouse -o jsonpath='{.items[0].metadata.name}')
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.metric_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.trace_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.text_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.query_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.part_log"
|
||||
```
|
||||
|
||||
This can take 30-60+ seconds per table on NFS due to part cleanup I/O.
|
||||
|
||||
### Permanent Fix: CronJob for Periodic Truncation
|
||||
|
||||
Add a Kubernetes CronJob that truncates system tables via the ClickHouse HTTP API:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_cron_job_v1" "clickhouse_truncate_logs" {
|
||||
metadata {
|
||||
name = "clickhouse-truncate-logs"
|
||||
namespace = "<namespace>"
|
||||
}
|
||||
spec {
|
||||
schedule = "0 */6 * * *"
|
||||
successful_jobs_history_limit = 1
|
||||
failed_jobs_history_limit = 1
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
container {
|
||||
name = "truncate"
|
||||
image = "curlimages/curl:8.12.1"
|
||||
command = ["sh", "-c", join(" && ", [
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.metric_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.trace_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.text_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.query_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.part_log'",
|
||||
"echo 'System logs truncated'"
|
||||
])]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### What Does NOT Work: Config.d XML Mount
|
||||
|
||||
**DO NOT** attempt to mount custom XML config files into `/etc/clickhouse-server/config.d/`
|
||||
via Kubernetes ConfigMap. Both approaches crash ClickHouse with exit code 36 (BAD_ARGUMENTS):
|
||||
|
||||
- **Full directory mount** (`mount_path = "/etc/clickhouse-server/config.d"`): Replaces
|
||||
the entire directory, deleting the built-in `docker_related_config.xml` that the
|
||||
entrypoint expects. Even if you include it in your ConfigMap, ClickHouse still crashes.
|
||||
|
||||
- **sub_path mount** (`sub_path = "custom.xml"`): Also crashes with exit code 36, even
|
||||
with minimal valid XML containing only `<background_pool_size>4</background_pool_size>`.
|
||||
|
||||
- Both `remove="1"` (to disable tables) and `<ttl>` (to set retention) config overrides
|
||||
crash with exit code 36.
|
||||
|
||||
This appears to be an issue with the `clickhouse/clickhouse-server:25.4.2` Docker image
|
||||
and how it preprocesses config at startup. The CronJob approach bypasses this entirely.
|
||||
|
||||
## Verification
|
||||
|
||||
After truncation, verify:
|
||||
|
||||
```bash
|
||||
# CPU should drop from ~900m to ~100m within minutes
|
||||
kubectl top pod -n <namespace> -l app=clickhouse
|
||||
|
||||
# No active merges
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT count() FROM system.merges"
|
||||
|
||||
# System tables should be small
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as size, sum(rows) as rows \
|
||||
FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC \
|
||||
FORMAT Pretty"
|
||||
```
|
||||
|
||||
## Diagnostic Commands
|
||||
|
||||
```bash
|
||||
# Check what's consuming CPU (merges vs queries)
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT * FROM system.merges FORMAT Pretty"
|
||||
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT query_id, elapsed, query FROM system.processes WHERE is_initial_query FORMAT Pretty"
|
||||
|
||||
# Check background pool config
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT name, value FROM system.server_settings \
|
||||
WHERE name IN ('background_pool_size', 'background_merges_mutations_concurrency_ratio') \
|
||||
FORMAT Pretty"
|
||||
|
||||
# Default is background_pool_size=16, concurrency_ratio=2 → up to 32 concurrent merges
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **Exit code 76**: ClickHouse crashes in `loadOutdatedDataParts()` when there are hundreds
|
||||
of outdated parts on NFS. The truncation CronJob prevents this by keeping tables small.
|
||||
|
||||
- **Exit code 36**: `BAD_ARGUMENTS` in ClickHouse. Triggered by config.d XML mounts in
|
||||
Kubernetes. Root cause unclear but reproducible across mount methods.
|
||||
|
||||
- **Default thread pools**: ClickHouse defaults to `background_pool_size=16` and
|
||||
`background_schedule_pool_size=512`, spawning 700+ threads even for a single-table
|
||||
workload. This overhead is unavoidable without config file changes.
|
||||
|
||||
- **NFS is fundamentally unsuitable** for ClickHouse's MergeTree engine. If data
|
||||
persistence is not critical (e.g., analytics data is small), consider `emptyDir` or
|
||||
local PV storage instead.
|
||||
|
||||
## See Also
|
||||
|
||||
- `k8s-nfs-mount-troubleshooting` — NFS mount failures and permission issues
|
||||
- `k8s-limitrange-oom-silent-kill` — LimitRange defaults causing OOM in ClickHouse containers
|
||||
|
|
@ -1,145 +0,0 @@
|
|||
---
|
||||
name: coturn-k8s-without-hostnetwork
|
||||
description: |
|
||||
Deploy coturn (TURN/STUN server) on Kubernetes without hostNetwork by using a
|
||||
narrow relay port range and MetalLB LoadBalancer service. Use when: (1) deploying
|
||||
a WebRTC relay server on k8s, (2) want coturn to run on any node (not pinned),
|
||||
(3) avoiding hostNetwork for better pod scheduling and multi-replica support,
|
||||
(4) need TURN for NAT traversal in WebRTC apps (video streaming, conferencing).
|
||||
Covers relay port range sizing, MetalLB IP sharing, ephemeral TURN credentials
|
||||
via HMAC-SHA1, and pfSense port forwarding.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# coturn on Kubernetes Without hostNetwork
|
||||
|
||||
## Problem
|
||||
TURN servers traditionally require hostNetwork because they relay media over a wide
|
||||
UDP port range (49152-65535). This pins the server to a single node, prevents rolling
|
||||
updates, and wastes cluster flexibility.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying a TURN/STUN server for WebRTC applications on Kubernetes
|
||||
- Want the TURN pod to be schedulable on any node
|
||||
- Need to avoid hostNetwork for better availability and scheduling
|
||||
|
||||
## Solution
|
||||
|
||||
### Key insight: Narrow the relay port range
|
||||
A home lab with ~20 concurrent WebRTC viewers needs ~40 relay ports (2 per viewer).
|
||||
Use 100 ports (49152-49252) instead of 16K. This makes it practical to expose via
|
||||
a K8s LoadBalancer service.
|
||||
|
||||
### Terraform module structure
|
||||
|
||||
```hcl
|
||||
locals {
|
||||
turn_port = 3478
|
||||
min_port = 49152
|
||||
max_port = 49252 # 100 ports — enough for ~50 concurrent streams
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "coturn" {
|
||||
spec {
|
||||
# No hostNetwork, no nodeSelector — runs anywhere
|
||||
template {
|
||||
spec {
|
||||
container {
|
||||
image = "coturn/coturn:latest"
|
||||
args = ["-c", "/etc/turnserver/turnserver.conf"]
|
||||
port {
|
||||
container_port = 3478
|
||||
protocol = "UDP"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "coturn" {
|
||||
metadata {
|
||||
annotations = {
|
||||
# Share an existing MetalLB IP to avoid consuming a new one
|
||||
"metallb.universe.tf/loadBalancerIPs" = "10.0.20.200"
|
||||
"metallb.universe.tf/allow-shared-ip" = "shared"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
type = "LoadBalancer"
|
||||
# Signaling port
|
||||
port {
|
||||
name = "turn-udp"
|
||||
port = 3478
|
||||
protocol = "UDP"
|
||||
}
|
||||
# Relay ports — dynamic block generates 100 port definitions
|
||||
dynamic "port" {
|
||||
for_each = range(49152, 49253)
|
||||
content {
|
||||
name = "relay-${port.value}"
|
||||
port = port.value
|
||||
target_port = port.value
|
||||
protocol = "UDP"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### coturn config (turnserver.conf)
|
||||
|
||||
```
|
||||
listening-port=3478
|
||||
fingerprint
|
||||
lt-cred-mech
|
||||
use-auth-secret
|
||||
static-auth-secret=YOUR_SECRET_HERE
|
||||
realm=yourdomain.com
|
||||
listening-ip=0.0.0.0
|
||||
min-port=49152
|
||||
max-port=49252
|
||||
no-multicast-peers
|
||||
no-cli
|
||||
```
|
||||
|
||||
### MetalLB IP sharing
|
||||
To reuse an existing MetalLB IP (e.g., the WireGuard/Shadowsocks shared IP):
|
||||
1. Add `metallb.universe.tf/allow-shared-ip: shared` to the coturn service
|
||||
2. The same annotation must exist on all other services sharing that IP
|
||||
3. **Port conflicts are not allowed** — verify no other service uses 3478 or 49152-49252
|
||||
4. After changing the IP annotation, **delete and recreate** the service — MetalLB won't reassign IPs on annotation changes alone
|
||||
|
||||
### Ephemeral TURN credentials
|
||||
coturn's `use-auth-secret` mode generates time-limited credentials via HMAC-SHA1:
|
||||
|
||||
```javascript
|
||||
const crypto = require('crypto');
|
||||
const TURN_SECRET = 'your-shared-secret';
|
||||
|
||||
function getTurnCredentials(name = 'user', ttl = 86400) {
|
||||
const timestamp = Math.floor(Date.now() / 1000) + ttl;
|
||||
const username = `${timestamp}:${name}`;
|
||||
const credential = crypto.createHmac('sha1', TURN_SECRET)
|
||||
.update(username).digest('base64');
|
||||
return { username, credential };
|
||||
}
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# STUN binding request (raw UDP probe)
|
||||
echo -ne '\x00\x01\x00\x00\x21\x12\xa4\x42\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' \
|
||||
| nc -u -w2 <METALLB_IP> 3478 | xxd | head -3
|
||||
# Response starting with 0101 = successful STUN binding response
|
||||
```
|
||||
|
||||
## Notes
|
||||
- 100 relay ports supports ~50 concurrent streams (2 ports per stream)
|
||||
- If you need more, increase `max_port` and add more ports to the service
|
||||
- coturn auto-detects pod IP — no need to set `relay-ip` or `external-ip` explicitly
|
||||
- For public access, add NAT port forwards on pfSense for UDP 3478 + 49152-49252
|
||||
- See also: `pfsense-nat-rule-creation` skill for adding the port forwards
|
||||
|
|
@ -1,99 +0,0 @@
|
|||
---
|
||||
name: crowdsec-agent-registration-failure
|
||||
description: |
|
||||
Fix CrowdSec agent pods stuck in CrashLoopBackOff after LAPI restart due to stale
|
||||
machine registrations. Use when: (1) CrowdSec agent init container fails with
|
||||
"user already exist" error during cscli lapi register, (2) agent pods show hundreds
|
||||
of init container restarts, (3) LAPI was restarted or redeployed but agents kept
|
||||
running with old credentials, (4) cscli machines list shows stale entries for
|
||||
current agent pod names. Covers deleting stale registrations to allow re-registration.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# CrowdSec Agent Registration Failure
|
||||
|
||||
## Problem
|
||||
After a CrowdSec LAPI restart or redeployment, agent DaemonSet pods lose their
|
||||
credentials but LAPI retains the old machine registrations. When agents try to
|
||||
re-register with the same pod name, the `wait-for-lapi-and-register` init container
|
||||
fails with `user already exist`, causing CrashLoopBackOff with hundreds of restarts.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Agent init container logs show: `Error: cscli lapi register: api client register: api register ... user 'crowdsec-agent-xxxxx': user already exist`
|
||||
- Agent pods show status `CrashLoopBackOff` or `Init:CrashLoopBackOff` with many restarts
|
||||
- `kubectl describe pod` shows `BackOff restarting failed container wait-for-lapi-and-register`
|
||||
- LAPI pods were recently restarted or redeployed
|
||||
- `cscli machines list` on LAPI shows entries matching the stuck agent pod names
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify stuck agents
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec
|
||||
```
|
||||
Note the pod names that are in CrashLoopBackOff (e.g., `crowdsec-agent-jr5q7`).
|
||||
|
||||
### Step 2: Confirm the init container error
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config logs -n crowdsec <agent-pod> -c wait-for-lapi-and-register --tail=5
|
||||
```
|
||||
Should show `user already exist` error.
|
||||
|
||||
### Step 3: Find a running LAPI pod
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep lapi
|
||||
```
|
||||
|
||||
### Step 4: Delete stale machine registrations from LAPI
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config exec -n crowdsec <lapi-pod> -- cscli machines delete <agent-pod-name>
|
||||
```
|
||||
Repeat for each stuck agent.
|
||||
|
||||
### Step 5: Wait for agents to recover
|
||||
The agents are in CrashLoopBackOff with exponential backoff (up to 5 minutes). They'll
|
||||
automatically retry registration and succeed after the stale entry is deleted. This can
|
||||
take up to 5 minutes per agent depending on where they are in the backoff cycle.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# All agents should show Running status
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep agent
|
||||
# DaemonSet should show all pods READY
|
||||
kubectl --kubeconfig $(pwd)/config get ds -n crowdsec
|
||||
```
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Identify stuck agents
|
||||
$ kubectl get pods -n crowdsec | grep agent
|
||||
crowdsec-agent-jr5q7 0/1 CrashLoopBackOff 485 3d
|
||||
crowdsec-agent-jw76q 1/1 Running 8 3d
|
||||
crowdsec-agent-mtgxh 0/1 CrashLoopBackOff 483 3d
|
||||
crowdsec-agent-pfw2l 0/1 CrashLoopBackOff 481 3d
|
||||
|
||||
# Delete stale registrations
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-jr5q7
|
||||
level=info msg="machine 'crowdsec-agent-jr5q7' deleted successfully"
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-mtgxh
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-pfw2l
|
||||
|
||||
# Wait ~5 minutes, then verify
|
||||
$ kubectl get pods -n crowdsec | grep agent
|
||||
crowdsec-agent-jr5q7 1/1 Running 1 3d
|
||||
crowdsec-agent-jw76q 1/1 Running 8 3d
|
||||
crowdsec-agent-mtgxh 1/1 Running 1 3d
|
||||
crowdsec-agent-pfw2l 1/1 Running 1 3d
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is a known limitation of the CrowdSec Helm chart — the init container registration
|
||||
script is not idempotent (it doesn't handle "already exists" by deleting and re-registering).
|
||||
- The `cscli machines list` output will show many historical stale entries from past
|
||||
DaemonSet rollouts. These are harmless but can be cleaned up if desired.
|
||||
- This issue also causes the CrowdSec blocklist import CronJob to fail, since it selects
|
||||
agent pods alphabetically and may pick a non-running one. Fixing the agents also fixes
|
||||
the blocklist import.
|
||||
- See also: `k8s-nfs-mount-troubleshooting` for other common pod startup failures.
|
||||
|
|
@ -1,310 +0,0 @@
|
|||
---
|
||||
name: fastapi-svelte-gpu-webui
|
||||
description: |
|
||||
Pattern for building web UIs for GPU-based CLI tools. Use when:
|
||||
(1) Wrapping a command-line tool with a web interface, (2) Building job queue
|
||||
systems for long-running GPU tasks, (3) Creating file upload/download workflows,
|
||||
(4) Need real-time progress updates via WebSocket, (5) Deploying to Kubernetes
|
||||
with GPU scheduling. Covers FastAPI backend, Svelte 5 frontend, NFS storage,
|
||||
and Terraform deployment.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-31
|
||||
---
|
||||
|
||||
# FastAPI + Svelte GPU WebUI Pattern
|
||||
|
||||
## Problem
|
||||
Many powerful tools are command-line only, making them inaccessible to non-technical
|
||||
users. Building a web UI requires handling file uploads, job queuing, progress tracking,
|
||||
and GPU resource scheduling.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- You have a CLI tool that does heavy processing (ML inference, media conversion, etc.)
|
||||
- Want to add a web interface for easier access
|
||||
- Need to track long-running job progress
|
||||
- Deploying to Kubernetes with GPU nodes
|
||||
- Files need to persist across pod restarts (NFS storage)
|
||||
|
||||
## Solution Overview
|
||||
|
||||
### Directory Structure
|
||||
```
|
||||
project-web/
|
||||
├── backend/
|
||||
│ ├── main.py # FastAPI app
|
||||
│ ├── api/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── routes.py # REST endpoints
|
||||
│ ├── services/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── converter.py # CLI wrapper + job manager
|
||||
│ ├── models/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── schemas.py # Pydantic models
|
||||
│ └── requirements.txt
|
||||
├── frontend/
|
||||
│ ├── src/
|
||||
│ │ ├── App.svelte
|
||||
│ │ ├── lib/
|
||||
│ │ │ ├── FileUpload.svelte
|
||||
│ │ │ ├── JobsList.svelte
|
||||
│ │ │ └── ProgressBar.svelte
|
||||
│ │ └── stores/
|
||||
│ │ └── jobs.js
|
||||
│ ├── package.json
|
||||
│ └── vite.config.js
|
||||
├── Dockerfile
|
||||
└── README.md
|
||||
```
|
||||
|
||||
### Backend: Job Manager Pattern
|
||||
```python
|
||||
# services/converter.py
|
||||
import asyncio
|
||||
import uuid
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional, Callable
|
||||
import subprocess
|
||||
|
||||
class Job:
|
||||
id: str
|
||||
filename: str
|
||||
status: str # pending, processing, completed, failed
|
||||
progress: float
|
||||
created_at: datetime
|
||||
output_file: Optional[str]
|
||||
error: Optional[str]
|
||||
|
||||
class JobManager:
|
||||
def __init__(self, storage_path: str = "/mnt"):
|
||||
self.storage_path = Path(storage_path)
|
||||
self.jobs: dict[str, Job] = {}
|
||||
self.progress_callbacks: dict[str, list[Callable]] = {}
|
||||
|
||||
def create_job(self, filename: str, **options) -> Job:
|
||||
job_id = str(uuid.uuid4())
|
||||
job = Job(
|
||||
id=job_id,
|
||||
filename=filename,
|
||||
status="pending",
|
||||
progress=0.0,
|
||||
created_at=datetime.now(),
|
||||
**options
|
||||
)
|
||||
self.jobs[job_id] = job
|
||||
return job
|
||||
|
||||
async def run_conversion(self, job_id: str):
|
||||
job = self.jobs[job_id]
|
||||
job.status = "processing"
|
||||
|
||||
input_path = self.storage_path / "uploads" / job.filename
|
||||
output_dir = self.storage_path / "outputs" / job_id
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Build command for CLI tool
|
||||
cmd = [
|
||||
"/path/to/cli-tool",
|
||||
str(input_path),
|
||||
"-o", str(output_dir),
|
||||
# Add other options...
|
||||
]
|
||||
|
||||
# Run with output capture for progress parsing
|
||||
process = await asyncio.create_subprocess_exec(
|
||||
*cmd,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
)
|
||||
|
||||
# Parse output for progress updates
|
||||
async def read_output(stream):
|
||||
while True:
|
||||
line = await stream.readline()
|
||||
if not line:
|
||||
break
|
||||
line_str = line.decode().strip()
|
||||
# Parse progress from CLI output
|
||||
if "%" in line_str:
|
||||
# Extract and update progress
|
||||
self.update_progress(job_id, parsed_progress)
|
||||
|
||||
await asyncio.gather(
|
||||
read_output(process.stdout),
|
||||
read_output(process.stderr)
|
||||
)
|
||||
|
||||
returncode = await process.wait()
|
||||
|
||||
if returncode == 0:
|
||||
output_files = list(output_dir.glob("*.m4b"))
|
||||
if output_files:
|
||||
job.output_file = output_files[0].name
|
||||
job.status = "completed"
|
||||
else:
|
||||
job.status = "failed"
|
||||
job.error = f"Exit code {returncode}"
|
||||
|
||||
job_manager = JobManager()
|
||||
```
|
||||
|
||||
### Backend: API Routes
|
||||
```python
|
||||
# api/routes.py
|
||||
from fastapi import APIRouter, UploadFile, File, HTTPException
|
||||
from fastapi.responses import FileResponse
|
||||
from pathlib import Path
|
||||
import shutil
|
||||
import asyncio
|
||||
|
||||
router = APIRouter(prefix="/api")
|
||||
|
||||
@router.post("/upload")
|
||||
async def upload_file(file: UploadFile = File(...)):
|
||||
upload_dir = Path("/mnt/uploads")
|
||||
upload_dir.mkdir(parents=True, exist_ok=True)
|
||||
file_path = upload_dir / file.filename
|
||||
|
||||
with file_path.open("wb") as buffer:
|
||||
shutil.copyfileobj(file.file, buffer)
|
||||
|
||||
return {"filename": file.filename, "size": file_path.stat().st_size}
|
||||
|
||||
@router.post("/jobs")
|
||||
async def create_job(request: JobCreate):
|
||||
job = job_manager.create_job(filename=request.filename, ...)
|
||||
asyncio.create_task(job_manager.run_conversion(job.id))
|
||||
return job
|
||||
|
||||
@router.get("/jobs")
|
||||
async def list_jobs():
|
||||
return job_manager.get_all_jobs()
|
||||
|
||||
@router.get("/jobs/{job_id}/download")
|
||||
async def download_job(job_id: str):
|
||||
job = job_manager.get_job(job_id)
|
||||
if not job or job.status != "completed":
|
||||
raise HTTPException(404)
|
||||
output_path = Path("/mnt/outputs") / job_id / job.output_file
|
||||
return FileResponse(output_path, filename=job.output_file)
|
||||
```
|
||||
|
||||
### Frontend: Svelte 5 Components
|
||||
```svelte
|
||||
<!-- FileUpload.svelte -->
|
||||
<script>
|
||||
let { onUpload } = $props();
|
||||
let dragOver = $state(false);
|
||||
let uploading = $state(false);
|
||||
|
||||
async function handleUpload(file) {
|
||||
uploading = true;
|
||||
const formData = new FormData();
|
||||
formData.append('file', file);
|
||||
|
||||
const response = await fetch('/api/upload', {
|
||||
method: 'POST',
|
||||
body: formData
|
||||
});
|
||||
|
||||
if (response.ok) {
|
||||
const data = await response.json();
|
||||
onUpload(data.filename);
|
||||
}
|
||||
uploading = false;
|
||||
}
|
||||
</script>
|
||||
|
||||
<div class="dropzone"
|
||||
class:dragover={dragOver}
|
||||
ondragover={(e) => { e.preventDefault(); dragOver = true; }}
|
||||
ondragleave={() => dragOver = false}
|
||||
ondrop={(e) => { e.preventDefault(); handleUpload(e.dataTransfer.files[0]); }}>
|
||||
Drop file here
|
||||
</div>
|
||||
```
|
||||
|
||||
### Dockerfile
|
||||
```dockerfile
|
||||
FROM python:3.12-slim
|
||||
|
||||
# Install Node for frontend build
|
||||
RUN apt-get update && apt-get install -y nodejs npm
|
||||
|
||||
# Build frontend
|
||||
COPY frontend/ /app/frontend/
|
||||
WORKDIR /app/frontend
|
||||
RUN npm install && npm run build
|
||||
|
||||
# Install backend
|
||||
COPY backend/ /app/backend/
|
||||
WORKDIR /app/backend
|
||||
RUN pip install -r requirements.txt
|
||||
|
||||
# Serve static files from FastAPI
|
||||
EXPOSE 8000
|
||||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
```
|
||||
|
||||
### Terraform Deployment (GPU)
|
||||
```hcl
|
||||
resource "kubernetes_deployment" "myapp" {
|
||||
spec {
|
||||
template {
|
||||
spec {
|
||||
node_selector = { "gpu" : "true" }
|
||||
|
||||
toleration {
|
||||
key = "nvidia.com/gpu"
|
||||
operator = "Equal"
|
||||
value = "true"
|
||||
effect = "NoSchedule"
|
||||
}
|
||||
|
||||
container {
|
||||
image = "myregistry/myapp@sha256:..."
|
||||
name = "myapp"
|
||||
|
||||
resources {
|
||||
limits = { "nvidia.com/gpu" = "1" }
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/mnt"
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = "10.0.10.15"
|
||||
path = "/mnt/main/myapp"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Upload a file via the UI
|
||||
2. Start a conversion job
|
||||
3. Watch progress update in real-time
|
||||
4. Download the completed file
|
||||
5. Verify files persist across pod restarts
|
||||
|
||||
## Notes
|
||||
- Use image digest for reliable deployments (see `k8s-docker-registry-cache-bypass` skill)
|
||||
- NFS storage persists across pod restarts
|
||||
- GPU node taints require matching tolerations
|
||||
- Consider adding job persistence (database) for production use
|
||||
- WebSocket can provide smoother progress updates than polling
|
||||
|
||||
## See Also
|
||||
- `k8s-docker-registry-cache-bypass` - Fixing image cache issues
|
||||
- `k8s-gpu-no-nvidia-devices` - GPU device troubleshooting
|
||||
- `python-filename-sanitization` - Secure file handling
|
||||
|
|
@ -1,105 +0,0 @@
|
|||
---
|
||||
name: grafana-stale-datasource-cleanup
|
||||
description: |
|
||||
Fix Grafana datasource errors when a Helm chart creates a datasource that conflicts
|
||||
with provisioned ones, or when stale datasources persist in the MySQL database.
|
||||
Use when: (1) Grafana shows "dial tcp: lookup <service> no such host" for a datasource,
|
||||
(2) Grafana API returns "datasources:delete permissions needed" when trying to remove
|
||||
a datasource, (3) provisioned datasource exists but Grafana uses a stale one from
|
||||
the database, (4) Helm chart auto-creates a datasource pointing to a disabled gateway
|
||||
service (e.g., loki-gateway). Requires direct MySQL access to fix when Grafana RBAC
|
||||
blocks API operations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Grafana Stale Datasource Cleanup
|
||||
|
||||
## Problem
|
||||
Grafana uses a stale or incorrect datasource from its MySQL database instead of
|
||||
the correctly provisioned one. Common when Helm charts auto-create datasources
|
||||
that point to services you've disabled (e.g., Loki gateway).
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Grafana shows error: `dial tcp: lookup loki-gateway on 10.96.0.10:53: no such host`
|
||||
- A provisioned datasource (via ConfigMap sidecar) is correct but Grafana uses a
|
||||
different one stored in MySQL
|
||||
- Grafana API returns `"permissions needed: datasources:delete"` or
|
||||
`"permissions needed: datasources:write"` even with admin credentials
|
||||
- Dashboard references a datasource UID that points to a wrong URL
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the stale datasource
|
||||
|
||||
List all datasources via API (this usually works even with RBAC):
|
||||
```bash
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s "http://localhost:3000/api/datasources" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -c \
|
||||
"import sys,json; [print(d['uid'], d['name'], d['url']) for d in json.load(sys.stdin)]"
|
||||
```
|
||||
|
||||
### Step 2: Try API deletion first
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s -X DELETE "http://localhost:3000/api/datasources/uid/<STALE_UID>" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"'
|
||||
```
|
||||
|
||||
If this returns a permissions error, proceed to Step 3.
|
||||
|
||||
### Step 3: Delete directly from MySQL
|
||||
|
||||
When Grafana RBAC blocks API operations, go through MySQL:
|
||||
|
||||
```bash
|
||||
# Find the Grafana MySQL password
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'echo $GF_DATABASE_PASSWORD'
|
||||
|
||||
# Find the stale datasource
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "SELECT id, uid, name, url FROM data_source;"
|
||||
|
||||
# Delete it
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "DELETE FROM data_source WHERE uid='<STALE_UID>';"
|
||||
```
|
||||
|
||||
### Step 4: Fix dashboards referencing the old UID
|
||||
|
||||
Dashboards store datasource UIDs in their JSON. Update via MySQL:
|
||||
```bash
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "UPDATE dashboard SET data = REPLACE(data, '<OLD_UID>', '<NEW_UID>') WHERE title LIKE '%Dashboard Name%';"
|
||||
```
|
||||
|
||||
### Step 5: Refresh Grafana
|
||||
|
||||
Hard-refresh browser (Cmd+Shift+R). If datasource still doesn't appear:
|
||||
```bash
|
||||
kubectl rollout restart deploy -n monitoring grafana
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Verify only correct datasources remain
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s "http://localhost:3000/api/datasources" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`
|
||||
and provisions datasources from them. These are file-provisioned and show as
|
||||
"provisioned" in the UI.
|
||||
- Helm charts (e.g., Loki) may auto-create their own datasource in the Grafana
|
||||
database pointing to services like `loki-gateway`. If you disable the gateway,
|
||||
this datasource becomes stale.
|
||||
- Grafana dashboards in this repo are stored in MySQL (not file-provisioned),
|
||||
so dashboard JSON files in the repo are reference copies only.
|
||||
- The `GF_SECURITY_ADMIN_PASSWORD` env var is set by the Grafana Helm chart.
|
||||
- See also: `loki-helm-deployment-pitfalls` for related Loki deployment issues.
|
||||
|
|
@ -1,253 +0,0 @@
|
|||
---
|
||||
name: helm-release-troubleshooting
|
||||
description: |
|
||||
Troubleshoot and fix Helm release issues managed by Terraform. Use when:
|
||||
(1) Terraform applies successfully but K8s resources don't reflect new Helm values,
|
||||
(2) New ports/volumes/containers from Helm chart values don't appear in deployed resources,
|
||||
(3) helm upgrade --reuse-values doesn't re-render templates for structural changes,
|
||||
(4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale,
|
||||
(5) terraform apply fails with "another operation (install/upgrade/rollback) is in progress",
|
||||
(6) helm history shows status "pending-upgrade" or "pending-rollback",
|
||||
(7) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
|
||||
(8) helm upgrade fails with "an error occurred while finding last successful release".
|
||||
Covers force re-rendering via state removal/reimport and stuck release recovery via
|
||||
secret cleanup.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Helm Release Troubleshooting
|
||||
|
||||
## Force Re-render
|
||||
|
||||
### Problem
|
||||
After changing Helm chart values in a Terraform `helm_release` resource, Terraform applies
|
||||
successfully but the actual Kubernetes resources (Services, Deployments, etc.) don't reflect
|
||||
the new values. For example, adding a new port in Helm values doesn't result in that port
|
||||
appearing in the Service spec.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- Terraform `helm_release` applies with "1 changed" but `kubectl get svc -o yaml` shows
|
||||
the old configuration
|
||||
- Structural changes to Helm values (new ports, new containers, new volumes) are not
|
||||
reflected in deployed resources
|
||||
- The Helm chart templates need to be fully re-rendered, not just patched
|
||||
- Common with Traefik, ingress-nginx, and other charts where template logic conditionally
|
||||
includes resources based on values
|
||||
|
||||
### Root Cause
|
||||
Terraform's `helm_release` resource uses `helm upgrade` under the hood. When values are
|
||||
changed, Helm may use `--reuse-values` behavior where it merges new values into existing
|
||||
ones rather than doing a full template re-render. For structural changes (like enabling
|
||||
HTTP/3 which adds a new UDP port to the Service template), the templates may not be
|
||||
re-rendered with the new conditional branches active.
|
||||
|
||||
Additionally, Terraform may see the stored Helm release state as matching the desired state
|
||||
even though the actual Kubernetes resources don't reflect it, creating a state drift that
|
||||
Terraform doesn't detect.
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Verify the Discrepancy
|
||||
|
||||
Confirm that K8s resources don't match Helm values:
|
||||
```bash
|
||||
# Check the actual resource
|
||||
kubectl get svc <service-name> -n <namespace> -o yaml
|
||||
|
||||
# Check what Helm thinks is deployed
|
||||
helm get values <release-name> -n <namespace>
|
||||
helm get manifest <release-name> -n <namespace> | grep -A10 "<expected-config>"
|
||||
```
|
||||
|
||||
#### Step 2: Remove Helm Release from Terraform State
|
||||
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
```
|
||||
|
||||
**IMPORTANT**: This only removes from Terraform state. The actual Helm release and K8s
|
||||
resources remain untouched in the cluster.
|
||||
|
||||
#### Step 3: Import the Helm Release Back
|
||||
|
||||
```bash
|
||||
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
|
||||
```
|
||||
|
||||
For Helm releases, the import ID format is `namespace/release-name`.
|
||||
|
||||
#### Step 4: Force Apply with Terraform
|
||||
|
||||
After reimporting, run terraform apply. Terraform should now detect the drift between
|
||||
the desired Helm values and the actual release state:
|
||||
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
If Terraform still shows "no changes", you may need to taint the resource:
|
||||
```bash
|
||||
terraform taint 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
#### Step 5: Manual Helm Force Upgrade (Last Resort)
|
||||
|
||||
If Terraform still doesn't fix it, use Helm directly as a one-time fix, then reimport:
|
||||
|
||||
```bash
|
||||
# Get the current values file
|
||||
helm get values <release-name> -n <namespace> -o yaml > /tmp/values.yaml
|
||||
|
||||
# Edit /tmp/values.yaml to include the correct values, or use --set flags
|
||||
|
||||
# Force upgrade (re-renders all templates)
|
||||
helm upgrade --force <release-name> <chart> -n <namespace> -f /tmp/values.yaml
|
||||
|
||||
# Then reimport into Terraform
|
||||
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
**WARNING**: Direct Helm operations bypass Terraform. Always reimport into Terraform state
|
||||
afterward, and use `terraform apply` to verify Terraform is back in sync.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check the K8s resources now match expected configuration
|
||||
kubectl get svc <service-name> -n <namespace> -o yaml
|
||||
kubectl get deployment <deployment-name> -n <namespace> -o yaml
|
||||
|
||||
# Verify Terraform is in sync
|
||||
terraform plan -target=module.kubernetes_cluster.module.<service>
|
||||
# Should show "No changes" or minimal expected drift
|
||||
```
|
||||
|
||||
### Example: Traefik HTTP/3 UDP Port Not Appearing
|
||||
|
||||
**Problem**: Added `http3.enabled=true` to Traefik Helm values. Terraform applied
|
||||
successfully, but the Traefik Service only had TCP port 443, missing the expected
|
||||
UDP port 443 (`websecure-http3`).
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# 1. Remove from state
|
||||
terraform state rm 'module.kubernetes_cluster.module.traefik.helm_release.traefik'
|
||||
|
||||
# 2. Reimport
|
||||
terraform import 'module.kubernetes_cluster.module.traefik.helm_release.traefik' 'traefik/traefik'
|
||||
|
||||
# 3. Apply (Terraform now detects the drift)
|
||||
terraform apply -target=module.kubernetes_cluster.module.traefik
|
||||
|
||||
# 4. Verify
|
||||
kubectl get svc traefik -n traefik -o yaml | grep -A3 "websecure-http3"
|
||||
# Should show: port: 443, protocol: UDP
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- This issue is more common with structural Helm value changes (new ports, new sidecars,
|
||||
conditional template blocks) than with simple value changes (image tags, replica counts)
|
||||
- The `helm upgrade --force` flag deletes and recreates resources that have changed,
|
||||
which causes brief downtime. Use with caution on production ingress controllers.
|
||||
- Always verify with `terraform plan` after fixing to ensure Terraform state is consistent
|
||||
|
||||
---
|
||||
|
||||
## Stuck Release Recovery
|
||||
|
||||
### Problem
|
||||
Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
|
||||
Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
|
||||
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
|
||||
- `helm upgrade` fails with: `an error occurred while finding last successful release`
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Identify the stuck release
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
|
||||
```
|
||||
|
||||
Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
|
||||
|
||||
#### Step 2: Delete the stuck Helm release secrets
|
||||
Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
|
||||
Delete all stuck revisions:
|
||||
|
||||
```bash
|
||||
# Delete specific stuck revision (e.g., revision 5)
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
|
||||
|
||||
# If multiple stuck revisions exist, delete all of them
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
|
||||
```
|
||||
|
||||
#### Step 3: Verify the release is clean
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
|
||||
```
|
||||
|
||||
The latest revision should now show `deployed` status.
|
||||
|
||||
#### Step 4: Retry the upgrade
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
|
||||
This changes the label but not the encoded release data inside the secret, leaving Helm in an
|
||||
inconsistent state. Always delete the stuck secrets entirely.
|
||||
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
|
||||
the next successful upgrade will reconcile the state.
|
||||
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
|
||||
over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
|
||||
|
||||
### Verification
|
||||
After deleting stuck secrets and re-applying:
|
||||
- `helm history` shows the new revision as `deployed`
|
||||
- `terraform apply` completes without errors
|
||||
|
||||
### Example
|
||||
```bash
|
||||
# Helm history shows stuck state
|
||||
$ helm history nextcloud -n nextcloud | tail -3
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout
|
||||
6 pending-rollback nextcloud-8.8.1 Rollback to 4
|
||||
|
||||
# Fix: delete stuck revisions
|
||||
$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
|
||||
|
||||
# Verify clean state
|
||||
$ helm history nextcloud -n nextcloud | tail -1
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
|
||||
# Re-apply
|
||||
$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- `terraform-state-identity-mismatch` - For Terraform provider identity errors
|
||||
- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for force re-render)
|
||||
|
||||
## References
|
||||
|
||||
- [Terraform helm_release Resource](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release)
|
||||
- [Helm Upgrade Documentation](https://helm.sh/docs/helm/helm_upgrade/)
|
||||
- [Helm --force Flag](https://helm.sh/docs/helm/helm_upgrade/#options)
|
||||
|
|
@ -1,157 +0,0 @@
|
|||
---
|
||||
name: ingress-factory-migration
|
||||
description: |
|
||||
Migrate raw kubernetes_ingress_v1 resources to the centralized ingress_factory module.
|
||||
Use when: (1) a service defines a raw kubernetes_ingress_v1 with hand-rolled Traefik
|
||||
middleware annotations, (2) adding a new service that needs standard ingress with
|
||||
rate limiting, CrowdSec, CSP headers, rybbit analytics, or authentik auth,
|
||||
(3) refactoring existing ingresses for consistency. Covers single-path, multi-path,
|
||||
split UI/API, full_host overrides, custom rate limits, and extra middleware injection.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-10
|
||||
---
|
||||
|
||||
# Ingress Factory Migration
|
||||
|
||||
## Problem
|
||||
Services define raw `kubernetes_ingress_v1` resources with hand-rolled Traefik middleware
|
||||
chains. This creates inconsistency - middleware chains are copy-pasted per service, making
|
||||
it easy to miss security middleware (CrowdSec, rate limiting) or analytics (rybbit). The
|
||||
`ingress_factory` module at `modules/kubernetes/ingress_factory/main.tf` provides a single
|
||||
point of control.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Service has a raw `kubernetes_ingress_v1` resource instead of using `module "ingress"`
|
||||
- Service has a manually defined `kubernetes_manifest` for rybbit analytics middleware
|
||||
- New service needs standard ingress configuration
|
||||
- Middleware chain needs to be updated across many services
|
||||
|
||||
## Solution
|
||||
|
||||
### Standard single-path ingress
|
||||
Replace the raw resource with:
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service-name>" # becomes the ingress name AND default hostname
|
||||
host = "<subdomain>" # optional: override hostname (if different from name)
|
||||
service_name = "<k8s-service-name>" # optional: defaults to name
|
||||
port = 80 # optional: defaults to 80
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = false # set true for authentik forward auth
|
||||
}
|
||||
```
|
||||
|
||||
### Multi-path / split UI+API
|
||||
Use two module calls with different names but same host:
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
host = "<subdomain>"
|
||||
service_name = "<ui-service>"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
rybbit_site_id = "<id>" # optional: adds rybbit analytics
|
||||
}
|
||||
|
||||
module "ingress-api" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>-api"
|
||||
host = "<subdomain>" # same host as UI
|
||||
service_name = "<api-service>"
|
||||
ingress_path = ["/api"]
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# No rybbit_site_id - API returns JSON, not HTML
|
||||
}
|
||||
```
|
||||
|
||||
### Full host override (for root domain like viktorbarzin.me)
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
service_name = "<k8s-service>"
|
||||
full_host = "viktorbarzin.me" # bypasses name.root_domain construction
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
```
|
||||
|
||||
### Custom rate limiting (e.g., immich)
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
skip_default_rate_limit = true
|
||||
extra_middlewares = ["traefik-<custom>-rate-limit@kubernetescrd"]
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
```
|
||||
|
||||
### Key variables reference
|
||||
| Variable | Default | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `name` | required | Ingress resource name + default hostname |
|
||||
| `host` | null | Override hostname prefix (name used if null) |
|
||||
| `full_host` | null | Override entire hostname (bypasses root_domain) |
|
||||
| `service_name` | null | K8s service name (name used if null) |
|
||||
| `port` | 80 | Backend service port |
|
||||
| `ingress_path` | ["/"] | URL paths to match |
|
||||
| `protected` | false | Adds authentik forward auth middleware |
|
||||
| `rybbit_site_id` | null | Adds rybbit analytics script injection |
|
||||
| `skip_default_rate_limit` | false | Omits default rate limiter |
|
||||
| `extra_middlewares` | [] | Additional middleware references to append |
|
||||
| `extra_annotations` | {} | Additional ingress annotations |
|
||||
| `allow_local_access_only` | false | Restricts to LAN/VPN |
|
||||
| `exclude_crowdsec` | false | Skips CrowdSec middleware |
|
||||
| `custom_content_security_policy` | null | Custom CSP header |
|
||||
|
||||
### After migration, delete:
|
||||
1. The raw `kubernetes_ingress_v1` resource
|
||||
2. Any manually defined `kubernetes_manifest "rybbit_analytics"` (the factory creates this automatically when `rybbit_site_id` is set)
|
||||
|
||||
## Gotchas
|
||||
|
||||
### Duplicate module names
|
||||
If the service directory has multiple `.tf` files (e.g., `main.tf` and `frame.tf`), check
|
||||
for existing `module "ingress"` blocks. Module names must be unique within a directory.
|
||||
Use a descriptive name like `module "ingress-immich"` instead.
|
||||
|
||||
### Terraform target module names with hyphens
|
||||
Module names in `terraform state list` may use hyphens (e.g., `module.real-estate-crawler`).
|
||||
When using `-target`, you must match the exact name including hyphens:
|
||||
```bash
|
||||
# Wrong - underscores:
|
||||
terraform apply -target=module.kubernetes_cluster.module.real_estate_crawler
|
||||
|
||||
# Correct - hyphens (quote to prevent shell interpretation):
|
||||
terraform apply '-target=module.kubernetes_cluster.module.real-estate-crawler'
|
||||
```
|
||||
|
||||
### Service name defaults
|
||||
The factory defaults `service_name` to `name`. If the K8s service has a different name
|
||||
than the ingress, you must explicitly set `service_name`. Common case: headscale has one
|
||||
K8s service named `headscale` with multiple ports, so the UI ingress needs
|
||||
`service_name = "headscale"` even though `name = "headscale-ui"`.
|
||||
|
||||
### Servarr subdirectory source path
|
||||
Services under `servarr/` need `../../ingress_factory` as the source path instead of
|
||||
`../ingress_factory`.
|
||||
|
||||
## Verification
|
||||
1. `terraform validate` - check for syntax errors
|
||||
2. `terraform plan -target=module.kubernetes_cluster.module.<service>` - verify old ingress destroyed, new created
|
||||
3. `kubectl get ingress -n <namespace>` - verify ingress exists with correct host/paths
|
||||
4. Browse the service URL to confirm accessibility
|
||||
|
||||
## Notes
|
||||
- Services using special protocols (gRPC, mTLS, WebSocket with custom headers) should NOT
|
||||
be migrated - keep raw `kubernetes_ingress_v1` for those
|
||||
- The factory automatically includes: rate-limit, CSP headers, CrowdSec, and entrypoint=websecure
|
||||
- When `rybbit_site_id` is set, the factory creates a `kubernetes_manifest` for the
|
||||
rewrite-body middleware that injects the analytics script into HTML responses
|
||||
|
|
@ -1,80 +0,0 @@
|
|||
---
|
||||
name: iterative-plan-review-with-subagents
|
||||
description: |
|
||||
Design pattern for reviewing implementation plans using parallel subagent reviewers
|
||||
with iterative refinement. Use when: (1) designing a complex infrastructure change
|
||||
that needs security + implementation review, (2) creating a migration plan with
|
||||
multiple phases, (3) any plan where missing a critical issue could cause data loss
|
||||
or security exposure. Spawns 2 reviewer agents (security + implementation), collects
|
||||
CRITICAL/IMPORTANT/NIT findings, fixes all CRITICALs, re-runs until zero CRITICALs.
|
||||
Typically converges in 2-3 iterations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-07
|
||||
---
|
||||
|
||||
# Iterative Plan Review with Subagents
|
||||
|
||||
## Problem
|
||||
Complex infrastructure plans have blind spots — security issues, implementation
|
||||
incompatibilities, race conditions, format mismatches. A single reviewer misses things.
|
||||
Multiple reviewers with different expertise catch more.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Writing a migration plan (e.g., secrets management, storage migration)
|
||||
- Designing a multi-phase infrastructure change
|
||||
- Any plan where a missed issue = downtime, data loss, or security exposure
|
||||
- User explicitly asks for plan review
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Write the plan as a markdown document
|
||||
Save to `docs/plans/YYYY-MM-DD-<topic>.md`
|
||||
|
||||
### 2. Spawn 2 reviewer agents in parallel
|
||||
```
|
||||
Agent 1: Security reviewer
|
||||
- Focus: secret exposure, access control, key management, CI pipeline security
|
||||
- Classify each finding: CRITICAL / IMPORTANT / NIT
|
||||
|
||||
Agent 2: Implementation reviewer
|
||||
- Focus: format compatibility, race conditions, ordering, tool behavior
|
||||
- Classify each finding: CRITICAL / IMPORTANT / NIT
|
||||
```
|
||||
|
||||
Key: give each reviewer specific focus areas and the actual source code to check against.
|
||||
|
||||
### 3. Consolidate and fix CRITICALs
|
||||
- Merge findings from both reviewers
|
||||
- Deduplicate (both often find the same issue)
|
||||
- Fix ALL CRITICALs in the plan document
|
||||
- Note IMPORTANTs for implementation phase
|
||||
|
||||
### 4. Re-run reviewers on the updated plan
|
||||
- Same 2 agents, but tell them which CRITICALs were fixed
|
||||
- Ask them to VERIFY fixes are correct AND find new issues
|
||||
- Repeat until zero CRITICALs
|
||||
|
||||
### 5. Typical convergence
|
||||
- v1: 5-6 CRITICALs (format issues, race conditions, missing steps)
|
||||
- v2: 2-3 CRITICALs (fixes introduced new issues, missed edge cases)
|
||||
- v3: 0 CRITICALs, only IMPORTANTs remaining
|
||||
|
||||
## Example Findings from Real Usage (SOPS migration)
|
||||
|
||||
| Iteration | CRITICALs Found | Examples |
|
||||
|-----------|----------------|---------|
|
||||
| v1 | 6 | YAML≠HCL format, `git add .` commits secrets, no branch protection, parallel race condition |
|
||||
| v2 | 3 | `SOPS_AGE_KEY_FILE` misunderstanding, `renew-tls.yml` not updated, plan leaks in PR logs |
|
||||
| v3 | 0 | All verified fixed. 6 IMPORTANTs noted for implementation. |
|
||||
|
||||
## Verification
|
||||
- Zero CRITICALs from both reviewers on the final iteration
|
||||
- IMPORTANTs documented as implementation notes (not blockers)
|
||||
|
||||
## Notes
|
||||
- Use `sonnet` model for reviewers (fast, thorough enough for review)
|
||||
- Give reviewers actual source code paths to read, not just the plan
|
||||
- Tell v2+ reviewers what was fixed so they verify, not re-discover
|
||||
- The final review should say "ONLY report CRITICALs" to avoid noise
|
||||
- This pattern cost ~$3-5 in API calls but caught issues that would have caused hours of debugging
|
||||
|
|
@ -1,244 +0,0 @@
|
|||
---
|
||||
name: k8s-container-image-caching
|
||||
description: |
|
||||
Set up and troubleshoot container image pull-through caches in Kubernetes. Use when:
|
||||
(1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror,
|
||||
(2) containerd has deprecated `registry.mirrors."*"` catching all image pulls,
|
||||
(3) need to add pull-through cache for a new upstream registry,
|
||||
(4) `mirrors` cannot be set when `config_path` is provided error in containerd,
|
||||
(5) containerd 1.6.x vs 1.7.x config_path compatibility issues,
|
||||
(6) kubectl shows correct image tag but container runs old code,
|
||||
(7) local registry mirror caches stale images,
|
||||
(8) imagePullPolicy: Always doesn't force fresh pulls,
|
||||
(9) containerd config has mirror that intercepts pulls serving stale images.
|
||||
Covers multi-registry pull-through cache setup (Docker Registry v2) and cache bypass
|
||||
via image digest pinning.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Kubernetes Container Image Caching
|
||||
|
||||
## Pull-Through Cache Setup
|
||||
|
||||
### Problem
|
||||
|
||||
Docker Registry v2 can only proxy **one upstream registry per instance**. A common
|
||||
misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing
|
||||
to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io,
|
||||
and other registries -- they get routed to the Docker Hub proxy which can't serve them,
|
||||
causing `ImagePullBackOff`.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries
|
||||
- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]`
|
||||
- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided`
|
||||
- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach
|
||||
|
||||
### Solution
|
||||
|
||||
#### 1. Run one Registry v2 container per upstream
|
||||
|
||||
Each upstream needs its own Docker Registry v2 instance on a different port:
|
||||
|
||||
| Port | Registry | Container Name |
|
||||
|------|----------|---------------|
|
||||
| 5000 | docker.io | registry |
|
||||
| 5010 | ghcr.io | registry-ghcr |
|
||||
| 5020 | quay.io | registry-quay |
|
||||
| 5030 | registry.k8s.io | registry-k8s |
|
||||
| 5040 | reg.kyverno.io | registry-kyverno |
|
||||
|
||||
Config for non-Docker-Hub proxies (no auth needed -- they're public):
|
||||
|
||||
```yaml
|
||||
version: 0.1
|
||||
storage:
|
||||
cache:
|
||||
blobdescriptor: inmemory
|
||||
filesystem:
|
||||
rootdirectory: /var/lib/registry
|
||||
http:
|
||||
addr: :5000
|
||||
proxy:
|
||||
remoteurl: https://ghcr.io # change per registry
|
||||
```
|
||||
|
||||
```bash
|
||||
docker run -p 5010:5000 -d --restart always --name registry-ghcr \
|
||||
-v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2
|
||||
```
|
||||
|
||||
#### 2. Replace deprecated wildcard mirror with `config_path`
|
||||
|
||||
Instead of:
|
||||
```toml
|
||||
# DEPRECATED - breaks non-Docker-Hub registries
|
||||
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
|
||||
endpoint = ["http://10.0.20.10:5000"]
|
||||
```
|
||||
|
||||
Use the modern `config_path` approach:
|
||||
```toml
|
||||
[plugins."io.containerd.grpc.v1.cri".registry]
|
||||
config_path = "/etc/containerd/certs.d"
|
||||
```
|
||||
|
||||
Then create per-registry `hosts.toml` files:
|
||||
```bash
|
||||
mkdir -p /etc/containerd/certs.d/docker.io
|
||||
cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF'
|
||||
server = "https://registry-1.docker.io"
|
||||
|
||||
[host."http://10.0.20.10:5000"]
|
||||
capabilities = ["pull", "resolve"]
|
||||
EOF
|
||||
```
|
||||
|
||||
Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage).
|
||||
|
||||
#### 3. Critical: `config_path` and `mirrors` cannot coexist
|
||||
|
||||
Containerd will **refuse to start the CRI plugin** if both `config_path` and any
|
||||
`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries
|
||||
(including the `[plugins."...registry.mirrors"]` parent section) before setting
|
||||
`config_path`.
|
||||
|
||||
This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master)
|
||||
where the config format is slightly different. If unsure, either:
|
||||
- Don't use config_path on that node (skip the pull-through cache)
|
||||
- Remove the entire `mirrors` section first, then add `config_path`
|
||||
|
||||
#### 4. Static IP for registry VM
|
||||
|
||||
If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP
|
||||
via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test each proxy responds
|
||||
for port in 5000 5010 5020 5030 5040; do
|
||||
curl -s http://10.0.20.10:$port/v2/_catalog
|
||||
done
|
||||
|
||||
# Test containerd can pull through cache
|
||||
crictl pull ghcr.io/some/image:tag
|
||||
|
||||
# Check containerd logs for mirror usage
|
||||
journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry"
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to
|
||||
direct pull from the upstream `server` URL. This provides graceful degradation.
|
||||
- **GC crontabs**: Add weekly garbage collection for each registry container, staggered
|
||||
to avoid I/O spikes.
|
||||
- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates.
|
||||
- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand.
|
||||
|
||||
---
|
||||
|
||||
## Cache Bypass / Stale Image Fix
|
||||
|
||||
### Problem
|
||||
Kubernetes pods continue running old Docker images even after pushing new versions with
|
||||
the same tag (e.g., `:latest`). This happens when a local registry mirror caches images
|
||||
and serves stale versions, ignoring `imagePullPolicy: Always`.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- Pod is running but application code is outdated
|
||||
- `docker push` succeeded with new layers
|
||||
- `kubectl describe pod` shows correct image tag
|
||||
- Cluster has a local registry mirror configured (e.g., in containerd config)
|
||||
- `imagePullPolicy: Always` doesn't fix the issue
|
||||
- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar
|
||||
|
||||
### Solution
|
||||
|
||||
#### 1. Get the image digest after pushing
|
||||
```bash
|
||||
docker push viktorbarzin/myimage:latest
|
||||
# Output includes: latest: digest: sha256:abc123... size: 856
|
||||
```
|
||||
|
||||
#### 2. Use digest instead of tag in deployment
|
||||
```hcl
|
||||
# Terraform
|
||||
container {
|
||||
# Use digest to bypass local registry cache
|
||||
image = "docker.io/viktorbarzin/myimage@sha256:abc123..."
|
||||
image_pull_policy = "Always"
|
||||
name = "myimage"
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Kubernetes YAML
|
||||
containers:
|
||||
- name: myimage
|
||||
image: docker.io/viktorbarzin/myimage@sha256:abc123...
|
||||
imagePullPolicy: Always
|
||||
```
|
||||
|
||||
#### 3. Apply and restart
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.myservice
|
||||
kubectl rollout restart deployment/myservice -n mynamespace
|
||||
```
|
||||
|
||||
### Why This Works
|
||||
- Registry mirrors match by tag, not digest
|
||||
- When you specify a digest, the node must fetch that exact manifest
|
||||
- The mirror may not have the digest cached, forcing a pull from upstream
|
||||
- Even if cached, the digest guarantees the exact image version
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Check the pod is using the new image
|
||||
kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}'
|
||||
|
||||
# Verify application behavior reflects new code
|
||||
kubectl exec -n mynamespace deploy/myservice -- <verification-command>
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
Before (problematic):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web:latest"
|
||||
```
|
||||
|
||||
After (fixed):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29"
|
||||
```
|
||||
|
||||
### Notes
|
||||
- You must update the digest each time you push a new image
|
||||
- Consider automating digest extraction in CI/CD pipelines
|
||||
- This is a workaround; ideally fix the registry mirror configuration
|
||||
- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes
|
||||
- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml`
|
||||
|
||||
### Diagnosing Registry Mirror Issues
|
||||
```bash
|
||||
# On a k8s node, check containerd config
|
||||
cat /etc/containerd/config.toml | grep -A5 mirrors
|
||||
|
||||
# Check if mirror is intercepting
|
||||
crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror
|
||||
|
||||
# List cached images on node
|
||||
crictl images | grep myimage
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)
|
||||
- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)
|
||||
|
|
@ -1,186 +0,0 @@
|
|||
---
|
||||
name: k8s-gpu-no-nvidia-devices
|
||||
description: |
|
||||
Fix for Kubernetes GPU pods showing "CUDA not supported" or no /dev/nvidia* devices
|
||||
despite nvidia.com/gpu resource allocation. Use when: (1) container runs but torch.cuda.is_available()
|
||||
returns False, (2) ls /dev/nvidia* shows "no matches found", (3) nvidia-smi fails inside pod
|
||||
but works on host, (4) PyTorch/TensorFlow falls back to CPU despite GPU allocation.
|
||||
Covers NVIDIA device plugin, time-slicing, and container runtime issues.
|
||||
author: Claude Code
|
||||
version: 1.1.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# Kubernetes GPU Pod - No NVIDIA Devices Found
|
||||
|
||||
## Problem
|
||||
|
||||
A Kubernetes pod requests GPU resources (`nvidia.com/gpu: 1`) and schedules on a GPU node,
|
||||
but inside the container there are no NVIDIA devices visible. The application falls back
|
||||
to CPU with messages like "CUDA not supported by the Torch installed!" despite running
|
||||
in a CUDA-enabled container image.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- Pod shows `Running` status and is on a node with `gpu=true` label
|
||||
- `kubectl describe pod` shows GPU limit/request is satisfied
|
||||
- Inside container: `ls /dev/nvidia*` returns "no matches found"
|
||||
- Inside container: `nvidia-smi` fails or command not found
|
||||
- Application logs show: "CUDA not supported", "Switching to CPU", "torch.cuda.is_available() = False"
|
||||
- On the host node: `nvidia-smi` works fine
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Verify GPU Availability
|
||||
|
||||
Check if other pods are consuming the GPU:
|
||||
|
||||
```bash
|
||||
# List all pods using GPU resources
|
||||
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) | "\(.metadata.namespace)/\(.metadata.name)"'
|
||||
|
||||
# Check NVIDIA device plugin pods
|
||||
kubectl get pods -n nvidia -l app=nvidia-device-plugin
|
||||
kubectl logs -n nvidia -l app=nvidia-device-plugin --tail=50
|
||||
```
|
||||
|
||||
### Step 2: Free GPU Resources
|
||||
|
||||
If another workload is using the GPU, unload it:
|
||||
|
||||
```bash
|
||||
# For Ollama specifically
|
||||
kubectl exec -n ollama deployment/ollama -- ollama stop <model_name>
|
||||
|
||||
# Or scale down the conflicting deployment
|
||||
kubectl scale deployment/<name> -n <namespace> --replicas=0
|
||||
```
|
||||
|
||||
### Step 3: Restart the Affected Pod
|
||||
|
||||
After freeing GPU resources, restart the pod to get fresh device allocation:
|
||||
|
||||
```bash
|
||||
kubectl rollout restart deployment/<name> -n <namespace>
|
||||
|
||||
# Or delete the pod directly
|
||||
kubectl delete pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
### Step 4: Verify GPU Access
|
||||
|
||||
```bash
|
||||
# Check devices are now visible
|
||||
kubectl exec -n <namespace> deployment/<name> -- ls -la /dev/nvidia*
|
||||
|
||||
# Test nvidia-smi
|
||||
kubectl exec -n <namespace> deployment/<name> -- nvidia-smi
|
||||
|
||||
# Test PyTorch CUDA
|
||||
kubectl exec -n <namespace> deployment/<name> -- python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
After restart, you should see:
|
||||
|
||||
```
|
||||
/dev/nvidia0
|
||||
/dev/nvidiactl
|
||||
/dev/nvidia-uvm
|
||||
/dev/nvidia-uvm-tools
|
||||
```
|
||||
|
||||
And `nvidia-smi` should show the GPU with your container process.
|
||||
|
||||
## Example
|
||||
|
||||
```bash
|
||||
# Problem: ebook2audiobook shows "CUDA not supported"
|
||||
$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- ls /dev/nvidia*
|
||||
zsh:1: no matches found: /dev/nvidia*
|
||||
|
||||
# Solution: Unload Ollama model holding the GPU
|
||||
$ kubectl exec -n ollama deployment/ollama -- ollama ps
|
||||
NAME SIZE PROCESSOR
|
||||
qwen2.5:14b 10 GB 33%/67% CPU/GPU
|
||||
|
||||
$ kubectl exec -n ollama deployment/ollama -- ollama stop qwen2.5:14b
|
||||
|
||||
# Restart the affected pod
|
||||
$ kubectl rollout restart deployment/ebook2audiobook -n ebook2audiobook
|
||||
|
||||
# Verify
|
||||
$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- nvidia-smi
|
||||
# Should now show the Tesla T4 GPU
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **GPU Time-Slicing**: If using NVIDIA GPU time-slicing (configured in GPU Operator),
|
||||
multiple pods can share a GPU. However, device injection still requires proper timing.
|
||||
|
||||
- **Pod Scheduling Order**: Pods that start while GPU is fully allocated may not get
|
||||
devices injected even after GPU becomes available - a restart is required.
|
||||
|
||||
- **Container Runtime**: The NVIDIA Container Toolkit must be properly configured.
|
||||
Issues can arise from:
|
||||
- cgroup driver mismatch (systemd vs cgroupfs)
|
||||
- Container updates causing device loss
|
||||
- SELinux blocking device access
|
||||
|
||||
- **Image Compatibility**: The container image must have CUDA libraries matching the
|
||||
driver version. Check with `nvidia-smi` on host for driver version.
|
||||
|
||||
- **This Cluster**: Uses NVIDIA GPU Operator with time-slicing (20 replicas per GPU).
|
||||
GPU node is `k8s-node1` with Tesla T4.
|
||||
|
||||
## See Also
|
||||
|
||||
- Check GPU Operator status: `kubectl get pods -n nvidia`
|
||||
- View time-slicing config: `kubectl get configmap -n nvidia time-slicing-config -o yaml`
|
||||
|
||||
## Automatic GPU Recovery via Liveness Probe
|
||||
|
||||
To prevent GPU loss from requiring manual intervention, add a liveness probe that checks
|
||||
both GPU availability and application health. Example for Frigate (but applicable to any
|
||||
GPU workload):
|
||||
|
||||
```hcl
|
||||
# Restart pod if GPU becomes unavailable or app hangs
|
||||
liveness_probe {
|
||||
exec {
|
||||
command = ["sh", "-c", "nvidia-smi > /dev/null 2>&1 && curl -sf http://localhost:<port>/health > /dev/null"]
|
||||
}
|
||||
initial_delay_seconds = 120
|
||||
period_seconds = 60
|
||||
timeout_seconds = 10
|
||||
failure_threshold = 3
|
||||
}
|
||||
# Allow time for GPU model loading at startup
|
||||
startup_probe {
|
||||
http_get {
|
||||
path = "/health"
|
||||
port = <port>
|
||||
}
|
||||
period_seconds = 10
|
||||
failure_threshold = 30 # up to 5 minutes
|
||||
}
|
||||
```
|
||||
|
||||
The liveness probe checks:
|
||||
- `nvidia-smi` — fails if GPU devices are no longer accessible (CUDA context corruption, device plugin issues)
|
||||
- `curl` health endpoint — fails if the application process is hung
|
||||
|
||||
If either fails 3 times in a row (3 minutes), Kubernetes automatically restarts the pod,
|
||||
which re-acquires the GPU device through the NVIDIA device plugin.
|
||||
|
||||
**Important**: Always pair with a `startup_probe` when using GPU workloads — model loading
|
||||
(TensorRT, ONNX, PyTorch) can take several minutes and would trip a liveness probe
|
||||
configured with a short `initial_delay_seconds`.
|
||||
|
||||
## References
|
||||
|
||||
- [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html)
|
||||
- [Kubernetes GPU Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
|
||||
- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html)
|
||||
|
|
@ -1,113 +0,0 @@
|
|||
---
|
||||
name: k8s-hpa-scaling-storm
|
||||
description: |
|
||||
Fix and prevent HPA (HorizontalPodAutoscaler) scaling storms where pods scale to
|
||||
maxReplicas uncontrollably. Use when: (1) HPA shows memory or CPU utilization at
|
||||
200%+ causing rapid scale-up, (2) dozens or hundreds of pods created by HPA in minutes,
|
||||
(3) cluster becomes unstable due to resource exhaustion from too many pods,
|
||||
(4) etcd timeouts or API server crashes from pod churn, (5) adding resource requests
|
||||
to a deployment that previously had none causes HPA to miscalculate utilization.
|
||||
Covers emergency response and prevention patterns.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# Kubernetes HPA Scaling Storm
|
||||
|
||||
## Problem
|
||||
When an HPA is configured with a memory or CPU utilization target but the underlying
|
||||
deployment has insufficient resource requests, the HPA calculates artificially high
|
||||
utilization percentages (e.g., 220% of a 256Mi request when actual usage is 570Mi).
|
||||
This causes the HPA to scale pods to maxReplicas (often 100) within minutes, exhausting
|
||||
cluster resources and potentially crashing etcd and the API server.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- `kubectl get hpa` shows `<unknown>/70%` or very high percentages (200%+)
|
||||
- Pod count for a deployment rapidly increases to maxReplicas
|
||||
- etcd timeout errors in `kubectl` or `terraform apply`
|
||||
- API server becomes unreachable (`connection refused` or `network is unreachable`)
|
||||
- Adding resource requests to a Helm chart that previously had none
|
||||
- Memory-based HPA targets with real usage far exceeding requests
|
||||
|
||||
## Solution
|
||||
|
||||
### Emergency Response (stop the storm)
|
||||
|
||||
**Step 1: Delete the HPA immediately**
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config delete hpa <hpa-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Step 2: Scale the deployment down**
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config scale deployment <name> -n <namespace> --replicas=2
|
||||
```
|
||||
|
||||
**Step 3: Wait for pods to terminate and cluster to stabilize**
|
||||
```bash
|
||||
# Watch pod count decrease
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n <namespace> -l <label> | wc -l
|
||||
```
|
||||
|
||||
If the API server is unresponsive, wait 3-5 minutes for it to self-recover. The kubelet
|
||||
will restart static pods (etcd, kube-apiserver) automatically.
|
||||
|
||||
### Prevention
|
||||
|
||||
**Rule 1: Set resource requests to match actual usage**
|
||||
Before enabling HPA, check actual resource consumption:
|
||||
```bash
|
||||
kubectl top pods -n <namespace> -l <label>
|
||||
```
|
||||
Set requests to the baseline (idle) usage, not the minimum possible value.
|
||||
|
||||
**Rule 2: Set reasonable maxReplicas**
|
||||
Never use maxReplicas > 10 unless you've verified the cluster can handle it.
|
||||
Default of 100 is almost never appropriate for a home/small cluster.
|
||||
|
||||
**Rule 3: Prefer CPU-only HPA targets**
|
||||
Memory-based scaling is problematic because:
|
||||
- Memory usage grows over time and rarely decreases
|
||||
- Memory-based scaling creates pods that never scale down
|
||||
- CPU is more responsive to load changes
|
||||
|
||||
**Rule 4: Test HPA changes on a deployment with 0 existing pods first**
|
||||
If adding resource requests to a deployment managed by HPA, temporarily disable
|
||||
the HPA first, set the requests, verify utilization is reasonable, then re-enable.
|
||||
|
||||
## Cascade Effects
|
||||
A scaling storm can cause:
|
||||
1. etcd storage exhaustion (too many pod objects)
|
||||
2. API server OOM or connection limits
|
||||
3. VPN/network connectivity loss (if VPN runs in the cluster)
|
||||
4. Kyverno webhook failures (admission controller overwhelmed)
|
||||
5. Other pods evicted or unable to schedule
|
||||
|
||||
## Verification
|
||||
- `kubectl get hpa -n <namespace>` shows reasonable utilization (< 100%)
|
||||
- Pod count is stable at expected replicas
|
||||
- `kubectl get nodes` responds promptly
|
||||
- No etcd timeout errors
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Observed: HPA scaling Collabora to 100 pods
|
||||
$ kubectl get hpa -n nextcloud
|
||||
NAME TARGETS MINPODS MAXPODS REPLICAS
|
||||
nextcloud-collabora cpu: 0%/70%, memory: 220%/50% 2 100 83
|
||||
|
||||
# Emergency fix
|
||||
$ kubectl delete hpa nextcloud-collabora -n nextcloud
|
||||
$ kubectl scale deployment nextcloud-collabora -n nextcloud --replicas=2
|
||||
|
||||
# Root cause: 256Mi memory request, actual usage 570Mi
|
||||
# Fix: increase request to 1Gi or disable memory target
|
||||
```
|
||||
|
||||
## Notes
|
||||
- If the HPA is managed by a Helm chart, deleting it via kubectl is temporary—the next
|
||||
Helm upgrade will recreate it. You must also update the Helm values.
|
||||
- In this project, Collabora was ultimately disabled in favor of OnlyOffice to avoid
|
||||
the HPA issue entirely.
|
||||
- See also: `helm-stuck-release-recovery` for fixing Helm releases broken by the storm.
|
||||
|
|
@ -1,235 +0,0 @@
|
|||
---
|
||||
name: k8s-nfs-mount-troubleshooting
|
||||
description: |
|
||||
Debug Kubernetes NFS volume mount failures. Use when: (1) Pod stuck in ContainerCreating
|
||||
for extended time, (2) kubectl describe shows "MountVolume.SetUp failed" with NFS errors,
|
||||
(3) Error message shows "Protocol not supported" or "mount.nfs: access denied",
|
||||
(4) NFS volume defined in pod spec but container won't start, (5) Container starts but
|
||||
gets "Permission denied" writing to NFS volume (non-root container UID mismatch),
|
||||
(6) CronJob or init container fails silently when writing to NFS, (7) Pod shows Running
|
||||
1/1 but service is unresponsive after a node reboot — stale NFS mount causes frozen
|
||||
processes with zero listening sockets. Common root causes are missing NFS export on the
|
||||
server, UID mismatch for non-root containers, and stale mounts after node reboots.
|
||||
author: Claude Code
|
||||
version: 1.2.0
|
||||
date: 2026-02-28
|
||||
---
|
||||
|
||||
# Kubernetes NFS Mount Troubleshooting
|
||||
|
||||
## Problem
|
||||
Pods with NFS volumes get stuck in `ContainerCreating` state indefinitely. The error
|
||||
messages from `kubectl describe pod` can be misleading, showing protocol or permission
|
||||
errors when the actual issue is the NFS export doesn't exist.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Pod status shows `ContainerCreating` for more than 1-2 minutes
|
||||
- `kubectl describe pod` shows events like:
|
||||
- `MountVolume.SetUp failed for volume "data" : mount failed: exit status 32`
|
||||
- `mount.nfs: Protocol not supported`
|
||||
- `mount.nfs: access denied by server`
|
||||
- Pod spec includes an NFS volume mount
|
||||
- Other pods on the same node work fine
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the NFS path
|
||||
```bash
|
||||
kubectl describe pod -n <namespace> <pod-name> | grep -A5 "Volumes:"
|
||||
```
|
||||
Look for the NFS server and path (e.g., `10.0.10.15:/mnt/main/myservice`)
|
||||
|
||||
### Step 2: Verify the export exists on NFS server
|
||||
SSH to the NFS server and check:
|
||||
```bash
|
||||
ssh root@<nfs-server> "ls -la /mnt/main/myservice"
|
||||
```
|
||||
|
||||
### Step 3: If directory doesn't exist, create it
|
||||
```bash
|
||||
ssh root@<nfs-server> "mkdir -p /mnt/main/myservice && chmod 777 /mnt/main/myservice"
|
||||
```
|
||||
|
||||
### Step 4: Add to NFS exports (TrueNAS specific)
|
||||
For TrueNAS, add the path to the NFS share configuration:
|
||||
1. Add directory to `scripts/nfs_directories.txt`
|
||||
2. Run `scripts/nfs_exports.sh` to update the share via API
|
||||
|
||||
### Step 5: Restart the pod
|
||||
```bash
|
||||
kubectl delete pod -n <namespace> -l app=<app-label>
|
||||
```
|
||||
The deployment will create a new pod that should now mount successfully.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
kubectl get pods -n <namespace>
|
||||
# Should show 1/1 Running instead of 0/1 ContainerCreating
|
||||
|
||||
kubectl exec -n <namespace> <pod-name> -- ls -la /app/data
|
||||
# Should show the mounted directory contents
|
||||
```
|
||||
|
||||
## Example
|
||||
**Symptom:**
|
||||
```
|
||||
Events:
|
||||
Warning FailedMount 55s (x13 over 11m) kubelet MountVolume.SetUp failed for volume "data" : mount failed: exit status 32
|
||||
Mounting command: mount
|
||||
Mounting arguments: -t nfs 10.0.10.15:/mnt/main/resume /var/lib/kubelet/pods/.../data
|
||||
Output: mount.nfs: Protocol not supported
|
||||
```
|
||||
|
||||
**Root Cause:** The directory `/mnt/main/resume` didn't exist on the TrueNAS server.
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
ssh root@10.0.10.15 'mkdir -p /mnt/main/resume && chmod 777 /mnt/main/resume'
|
||||
# Then add to NFS exports and restart pod
|
||||
```
|
||||
|
||||
## Notes
|
||||
- The "Protocol not supported" error is misleading - it often means the export path doesn't exist
|
||||
- Always check the NFS server first before investigating protocol/firewall issues
|
||||
- For TrueNAS, the NFS share must be updated via API/UI after creating new directories
|
||||
- NFSv3 vs NFSv4 issues are rare in modern setups; missing paths are more common
|
||||
- Check that the NFS client packages are installed on Kubernetes nodes if this is a new cluster
|
||||
|
||||
## Variant: Non-Root Container UID Permission Denied
|
||||
|
||||
### Problem
|
||||
Container starts and mounts NFS successfully, but gets "Permission denied" when
|
||||
writing files. The pod appears healthy but operations fail silently.
|
||||
|
||||
### Trigger Conditions
|
||||
- Container logs show "Permission denied" or "client returned ERROR on write"
|
||||
- Pod is Running (not stuck in ContainerCreating)
|
||||
- NFS directory exists and is mounted, but owned by root (uid 0)
|
||||
- Container image runs as a non-root user (e.g., `curlimages/curl` runs as uid 101)
|
||||
- CronJobs or init containers that write to NFS fail with no obvious error
|
||||
|
||||
### Common Non-Root Container UIDs
|
||||
| Image | UID | User |
|
||||
|-------|-----|------|
|
||||
| `curlimages/curl` | 101 | curl_user |
|
||||
| `nginx` (unprivileged) | 101 | nginx |
|
||||
| `node` | 1000 | node |
|
||||
| `python` (slim) | 0 | root (safe) |
|
||||
| `grafana/grafana` | 472 | grafana |
|
||||
|
||||
### Solution
|
||||
Fix permissions on the NFS server:
|
||||
```bash
|
||||
# Option 1: World-writable (simplest, suitable for non-sensitive data)
|
||||
ssh root@10.0.10.15 "chmod -R 777 /mnt/main/<service>/<subdir>"
|
||||
|
||||
# Option 2: Match container UID (more secure)
|
||||
ssh root@10.0.10.15 "chown -R <uid>:<gid> /mnt/main/<service>/<subdir>"
|
||||
|
||||
# Option 3: Use securityContext in pod spec to run as root
|
||||
spec:
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
```
|
||||
|
||||
### Debugging
|
||||
```bash
|
||||
# Check what UID the container runs as
|
||||
kubectl exec -n <namespace> <pod> -- id
|
||||
|
||||
# Test write access from inside container
|
||||
kubectl exec -n <namespace> <pod> -- sh -c 'echo test > /path/to/nfs/testfile'
|
||||
|
||||
# Check NFS directory ownership on server
|
||||
ssh root@10.0.10.15 "ls -la /mnt/main/<service>/"
|
||||
```
|
||||
|
||||
## Variant: Stale NFS Mounts After Node Reboot (Ghost Running Pods)
|
||||
|
||||
### Problem
|
||||
After a node reboot (e.g., from kured rolling kernel updates), pods are rescheduled and
|
||||
show `Running 1/1` status, but the application process is frozen/hung. The service is
|
||||
completely unresponsive despite appearing healthy to Kubernetes.
|
||||
|
||||
### Trigger Conditions
|
||||
- Node was recently rebooted (check `kubectl get nodes` for age, or kured logs)
|
||||
- Pod shows `Running 1/1` with 0 restarts (looks perfectly healthy)
|
||||
- Service is unresponsive — Uptime Kuma or curl shows timeout/connection refused
|
||||
- `kubectl exec <pod> -- ss -tlnp` shows **zero listening sockets** (the process started but is hung)
|
||||
- Pod uses NFS volumes (inline `nfs {}` or PVC backed by NFS)
|
||||
- Multiple pods across different namespaces all exhibit the same symptom simultaneously
|
||||
- `kubectl describe pod` shows no warnings or errors — everything looks normal
|
||||
|
||||
### Root Cause
|
||||
When a node reboots, the NFS client mounts go stale. If the pod is rescheduled to the
|
||||
same or different node before NFS fully recovers, the application process starts but
|
||||
immediately hangs when it tries to access the NFS-mounted filesystem. The process is
|
||||
stuck in an uninterruptible I/O wait (D state) but Kubernetes sees the container as
|
||||
running because the PID exists and liveness probes (if any) may not exercise the NFS path.
|
||||
|
||||
### Solution
|
||||
Force-delete the affected pods to trigger a clean reschedule with fresh NFS mounts:
|
||||
|
||||
```bash
|
||||
# Identify hung pods — Running but no listening sockets
|
||||
kubectl exec -n <namespace> <pod> -- ss -tlnp 2>/dev/null
|
||||
# If output is empty or shows no expected ports, the pod is hung
|
||||
|
||||
# Force-delete to skip graceful shutdown (hung process won't respond to SIGTERM)
|
||||
kubectl delete pod -n <namespace> <pod> --force --grace-period=0
|
||||
|
||||
# The deployment controller creates a new pod with fresh NFS mounts
|
||||
kubectl get pods -n <namespace> -w
|
||||
```
|
||||
|
||||
For bulk remediation after a cluster-wide event:
|
||||
```bash
|
||||
# Find all pods with NFS volumes that might be hung
|
||||
# Check each service's expected port — if ss -tlnp shows nothing, force-delete
|
||||
for ns in calibre stirling-pdf send speedtest n8n paperless-ngx; do
|
||||
pod=$(kubectl get pod -n $ns -o name | head -1)
|
||||
sockets=$(kubectl exec -n $ns ${pod} -- ss -tlnp 2>/dev/null | wc -l)
|
||||
if [ "$sockets" -le 1 ]; then
|
||||
echo "HUNG: $ns/$pod (no listening sockets)"
|
||||
kubectl delete ${pod} -n $ns --force --grace-period=0
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# New pod should have listening sockets
|
||||
kubectl exec -n <namespace> <new-pod> -- ss -tlnp
|
||||
# Should show the application's expected port (e.g., *:8080)
|
||||
|
||||
# Service should respond
|
||||
kubectl exec -n <namespace> <new-pod> -- curl -sI http://localhost:<port>/
|
||||
# Should return HTTP response
|
||||
```
|
||||
|
||||
### Key Diagnostic Insight
|
||||
The critical signal is **Running 1/1 but zero listening sockets**. Normal healthy pods
|
||||
always have at least one listening socket for their application port. If `ss -tlnp`
|
||||
returns nothing, the process is hung on a stale NFS mount, not crashed — that's why
|
||||
Kubernetes thinks it's fine.
|
||||
|
||||
### Prevention
|
||||
- Add **liveness probes** that hit the application's HTTP endpoint (not just TCP connect):
|
||||
```hcl
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 60
|
||||
period_seconds = 30
|
||||
timeout_seconds = 5
|
||||
}
|
||||
```
|
||||
- This ensures Kubernetes detects hung pods and restarts them automatically.
|
||||
|
||||
## See Also
|
||||
- **nfsv4-idmapd-uid-mapping** — All UIDs show as 65534 (nobody) inside containers. Different from permission denied; the UIDs are wrong, not the permissions.
|
||||
- TrueNAS NFS configuration documentation
|
||||
- Kubernetes NFS volume documentation
|
||||
- k8s-limitrange-oom-silent-kill (for OOM issues often confused with NFS hangs)
|
||||
|
|
@ -1,109 +0,0 @@
|
|||
---
|
||||
name: kubelet-static-pod-manifest-update
|
||||
description: |
|
||||
Force kubelet to pick up changes to static pod manifests in /etc/kubernetes/manifests/.
|
||||
Use when: (1) edited kube-apiserver.yaml but the running process still has old flags,
|
||||
(2) kubelet restart doesn't pick up manifest changes, (3) touching the manifest file
|
||||
doesn't trigger pod recreation, (4) killing the API server process results in the
|
||||
same old args on restart, (5) the pod's config.hash annotation doesn't match the
|
||||
file's hash. Requires a full cycle: remove manifest, stop kubelet, remove containers,
|
||||
re-add manifest, start kubelet.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Kubelet Static Pod Manifest Update
|
||||
|
||||
## Problem
|
||||
After editing a static pod manifest (e.g., `/etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||
to add OIDC or audit flags), kubelet continues running the pod with the old configuration.
|
||||
Standard approaches like `touch`, `systemctl restart kubelet`, or `kubectl delete pod`
|
||||
do not force kubelet to reconcile the new manifest.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Edited `/etc/kubernetes/manifests/kube-apiserver.yaml` (or other static pod manifests)
|
||||
- The running process (`ps aux | grep kube-apiserver`) shows old flags
|
||||
- `kubectl get pod -n kube-system kube-apiserver-* -o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'` returns a stale hash
|
||||
- Any of these actions failed to apply the changes:
|
||||
- `touch /etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||
- `systemctl restart kubelet`
|
||||
- `kubectl delete pod kube-apiserver-*`
|
||||
- Killing the API server process directly
|
||||
|
||||
## Root Cause
|
||||
Kubelet maintains an internal cache of static pod specs keyed by a hash of the manifest.
|
||||
When the manifest changes, kubelet should detect the new hash and recreate the pod.
|
||||
However, in practice (observed on Kubernetes 1.34.x), kubelet can get stuck with the
|
||||
old hash if:
|
||||
- The pod's mirror object in the API server still exists with the old hash
|
||||
- Kubelet's internal pod cache wasn't cleared between restarts
|
||||
- The container runtime (containerd) still has the old container running
|
||||
|
||||
## Solution
|
||||
|
||||
Full restart cycle on the master node:
|
||||
|
||||
```bash
|
||||
# 1. Back up the manifest
|
||||
sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak
|
||||
|
||||
# 2. Remove the manifest (kubelet will stop the pod)
|
||||
sudo rm /etc/kubernetes/manifests/kube-apiserver.yaml
|
||||
|
||||
# 3. Stop kubelet
|
||||
sudo systemctl stop kubelet
|
||||
|
||||
# 4. Wait for the API server container to stop
|
||||
sleep 5
|
||||
|
||||
# 5. Force-remove any remaining API server containers
|
||||
sudo crictl rm -f $(sudo crictl ps -aq --name kube-apiserver 2>/dev/null) 2>/dev/null
|
||||
|
||||
# 6. Re-add the manifest (with your changes)
|
||||
sudo cp /tmp/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml
|
||||
|
||||
# 7. Start kubelet
|
||||
sudo systemctl start kubelet
|
||||
|
||||
# 8. Wait for API server to come up (30-60 seconds)
|
||||
sleep 45
|
||||
|
||||
# 9. Verify new flags are active
|
||||
sudo cat /proc/$(pgrep -f 'kube-apiserver --' | head -1)/cmdline | tr '\0' '\n' | grep 'your-new-flag'
|
||||
```
|
||||
|
||||
**Critical:** The order matters. Removing the manifest BEFORE stopping kubelet ensures
|
||||
kubelet processes the removal. Then clearing containers ensures no stale state. Finally,
|
||||
re-adding the manifest with kubelet running triggers a fresh pod creation.
|
||||
|
||||
## What Does NOT Work
|
||||
|
||||
| Approach | Why it fails |
|
||||
|----------|-------------|
|
||||
| `touch manifest.yaml` | Kubelet may not detect mtime-only changes |
|
||||
| `systemctl restart kubelet` | Kubelet reuses cached pod spec if hash matches |
|
||||
| `kubectl delete pod` | Deletes mirror pod but kubelet recreates from cached spec |
|
||||
| `kill <apiserver-pid>` | Container runtime restarts the same container with old args |
|
||||
| Moving manifest away and back without stopping kubelet | Kubelet may cache the old spec in memory |
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Check the running process has new flags
|
||||
ps aux | grep kube-apiserver | grep -v grep | grep 'your-new-flag'
|
||||
|
||||
# Check the config hash changed
|
||||
kubectl get pod -n kube-system kube-apiserver-$(hostname) \
|
||||
-o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'
|
||||
|
||||
# Check API server logs for successful startup
|
||||
kubectl logs -n kube-system kube-apiserver-$(hostname) | tail -5
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This applies to ALL static pods, not just kube-apiserver (etcd, controller-manager, scheduler)
|
||||
- The cluster will be briefly unavailable during the restart (30-60 seconds)
|
||||
- On single-master clusters, kubectl commands will fail during the restart — use `sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf` from the master
|
||||
- Always validate the YAML before removing the manifest: `python3 -c "import yaml; yaml.safe_load(open('/etc/kubernetes/manifests/kube-apiserver.yaml'))"`
|
||||
- See also: `authentik-oidc-kubernetes` skill for the full OIDC setup context
|
||||
|
|
@ -1,143 +0,0 @@
|
|||
---
|
||||
name: local-llm-gpu-selection
|
||||
description: |
|
||||
Guide for selecting GPUs and hardware for local LLM inference on Dell R730 and
|
||||
comparing to Apple Silicon alternatives. Use when: (1) user asks about running
|
||||
local models (Ollama, llama.cpp), (2) user asks which GPU to buy for LLMs,
|
||||
(3) user wants to compare local models to Claude for coding, (4) user asks about
|
||||
quantized model selection, (5) user asks about Mac Mini/Studio vs GPU server for
|
||||
LLMs. Covers VRAM requirements, memory bandwidth as key metric, R730 GPU compatibility,
|
||||
multi-GPU considerations, and realistic quality comparisons to Claude models.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-06-11
|
||||
---
|
||||
|
||||
# Local LLM GPU Selection & Performance Guide
|
||||
|
||||
## Problem
|
||||
Choosing the right hardware for local LLM inference requires understanding the
|
||||
relationship between VRAM capacity, memory bandwidth, GPU compatibility with
|
||||
server chassis, and realistic model quality expectations.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- User asks about running quantized models locally (Ollama, llama.cpp)
|
||||
- User wants to know which GPU fits their server (Dell R730 or similar 2U)
|
||||
- User asks about Apple Silicon (Mac Mini/Studio) vs datacenter GPUs for LLMs
|
||||
- User wants to compare local model quality to Claude (Opus/Sonnet/Haiku) for coding
|
||||
|
||||
## Key Principle: Memory Bandwidth Is Everything
|
||||
|
||||
LLM token generation is **memory-bandwidth bound**, not compute bound. The formula:
|
||||
```
|
||||
approx tokens/sec = memory_bandwidth_GB_s / model_size_GB
|
||||
```
|
||||
This is why Apple Silicon (high bandwidth unified memory) competes with datacenter GPUs
|
||||
despite having less raw compute.
|
||||
|
||||
## VRAM Requirements by Model Size
|
||||
|
||||
| Model Size | Quant | VRAM Needed | Examples |
|
||||
|------------|-------|-------------|----------|
|
||||
| 7-8B | Q4_K_M | ~5 GB | Llama 3.1 8B, Mistral 7B |
|
||||
| 7-8B | Q8_0 | ~8 GB | |
|
||||
| 13-14B | Q4_K_M | ~8 GB | Qwen 2.5 Coder 14B |
|
||||
| 22-24B | Q4_K_M | ~13-14 GB | Mistral Small, Codestral |
|
||||
| 32B | Q4_K_M | ~20 GB | Qwen 2.5 Coder 32B |
|
||||
| 32B | Q8_0 | ~34 GB | |
|
||||
| 70B | Q4_K_M | ~40 GB | Llama 3.1 70B |
|
||||
| 70B | Q8_0 | ~70 GB | |
|
||||
|
||||
Add ~1-2 GB overhead for KV cache and context. Longer conversations use more.
|
||||
|
||||
## Dell R730 GPU Compatibility
|
||||
|
||||
### Constraints
|
||||
- **2U chassis**: Full-height cards fit, but limited to dual-slot width
|
||||
- **PCIe 3.0 x16 slots**: 2-3 usable slots depending on riser configuration
|
||||
- **Power**: Needs Dell GPU power cable (P/N 0D4J0T) for GPUs >75W TDP
|
||||
- **PSU**: Check wattage headroom (dual 750W or 1100W typical)
|
||||
|
||||
### Compatible GPUs
|
||||
|
||||
**No external power needed (<=75W):**
|
||||
- Tesla T4: 16 GB, 320 GB/s, 70W — best drop-in option
|
||||
- Tesla P4: 8 GB, 192 GB/s, 75W — too little VRAM for modern LLMs
|
||||
- NVIDIA L4: 24 GB, 300 GB/s, 72W — T4 successor, Ada Lovelace, expensive
|
||||
- NVIDIA A2: 16 GB, 200 GB/s, 60W — worse than T4 in every way, avoid
|
||||
|
||||
**Requires power cable (>75W):**
|
||||
- Tesla P40: 24 GB, 346 GB/s, 250W — best value per GB
|
||||
- Tesla V100 PCIe: 32 GB, 900 GB/s, 250W — excellent bandwidth
|
||||
- Tesla P100 PCIe: 16 GB, 732 GB/s, 250W — same VRAM as T4, not worth it
|
||||
|
||||
**Won't fit:**
|
||||
- RTX 3090/4090: Too thick (3-slot), too long
|
||||
- A100: Fits physically but very expensive
|
||||
- Any consumer RTX: Generally too large for 2U
|
||||
|
||||
### Multi-GPU Considerations
|
||||
- Ollama splits model layers across GPUs automatically
|
||||
- PCIe 3.0 cross-GPU transfer adds ~30-40% latency penalty
|
||||
- Mismatched GPUs (e.g., T4 + P40) work but the slower card bottlenecks
|
||||
- R730 PCIe 3.0 limits newer GPU bandwidth (L4 runs at half its rated speed)
|
||||
|
||||
## Apple Silicon Comparison
|
||||
|
||||
Apple Silicon unified memory means ALL system RAM = VRAM with no bus penalty.
|
||||
|
||||
| Device | Memory | Bandwidth | Advantage |
|
||||
|--------|--------|-----------|-----------|
|
||||
| Mac Mini M4 Pro 48 GB | 48 GB | 273 GB/s | Silent, 25W, no PCIe penalty |
|
||||
| Mac Studio M4 Max 128 GB | 128 GB | 546 GB/s | Run 100B+ models |
|
||||
| Mac Studio M4 Ultra 192 GB | 192 GB | 819 GB/s | Run anything |
|
||||
|
||||
A Mac Mini M4 Pro 48GB often matches or beats a T4+L4 multi-GPU setup for
|
||||
LLM inference due to zero cross-GPU overhead and high unified bandwidth.
|
||||
|
||||
## Best Coding Models (for Ollama)
|
||||
|
||||
For coding tasks specifically, prefer dedicated coding models:
|
||||
1. **Qwen 2.5 Coder 32B** — best open-source coding model in this size class
|
||||
2. **Codestral 22B** — Mistral's dedicated coding model
|
||||
3. **DeepSeek Coder V2** — good quality, efficient
|
||||
4. **Llama 3.1 70B** — strong general purpose but needs ~40 GB
|
||||
|
||||
## Realistic Quality Comparison to Claude
|
||||
|
||||
For Claude Code-style agentic coding workflows:
|
||||
|
||||
| Capability | Opus/Sonnet | Haiku | Qwen 2.5 Coder 32B | 70B General |
|
||||
|-----------|-------------|-------|---------------------|-------------|
|
||||
| Single function gen | Excellent | Good | Good | Decent |
|
||||
| Multi-file refactoring | Excellent | Decent | Weak | Weak |
|
||||
| Tool use / agentic loops | Excellent | Good | Poor | Poor |
|
||||
| Long context (large codebases) | Excellent | Good | Weak | Weak |
|
||||
|
||||
Local models work for simple completions and code questions. They struggle badly
|
||||
with Claude Code's complex multi-step tool-use workflows, long context windows,
|
||||
and self-correction capabilities.
|
||||
|
||||
## Quantization Quality Guide
|
||||
|
||||
From best to worst quality (and largest to smallest):
|
||||
- FP16: Full precision, baseline quality
|
||||
- Q8_0: Near-lossless, ~50% size reduction
|
||||
- Q6_K: Minimal quality loss
|
||||
- Q5_K_M: Good balance
|
||||
- Q4_K_M: **Recommended default** — best quality/size tradeoff
|
||||
- Q3_K_M: Noticeable degradation on complex reasoning
|
||||
- Q2_K: Significant quality loss, emergency only
|
||||
|
||||
## Verification
|
||||
- Check GPU compatibility: `lspci | grep -i nvidia` on the host
|
||||
- Check available VRAM: `nvidia-smi` inside the GPU VM
|
||||
- Check model fit: Ollama shows VRAM usage during `ollama run`
|
||||
- Check inference speed: Count tokens/sec in Ollama output
|
||||
|
||||
## Notes
|
||||
- GPU prices fluctuate significantly in the used market; check current prices
|
||||
- The T4 is PCIe 3.0 only; newer GPUs in PCIe 3.0 slots run at reduced bandwidth
|
||||
- Power consumption matters for 24/7 homelab use (electricity cost)
|
||||
- For Claude Code specifically, API-based Claude models remain significantly
|
||||
superior to any local model for agentic coding workflows
|
||||
|
|
@ -1,143 +0,0 @@
|
|||
---
|
||||
name: loki-helm-deployment-pitfalls
|
||||
description: |
|
||||
Fix common Loki Helm chart deployment failures on Kubernetes with Terraform.
|
||||
Use when: (1) Loki pod fails with "mkdir: read-only file system" for compactor
|
||||
or ruler paths, (2) Helm chart fails with "Helm test requires the Loki Canary
|
||||
to be enabled", (3) Helm install fails with "cannot re-use a name that is still
|
||||
in use" after a failed atomic deploy, (4) PV stuck in Released state after failed
|
||||
Helm install, (5) "entry too far behind" errors flooding Loki logs after initial
|
||||
Alloy deployment. Covers single-binary mode with filesystem storage on NFS.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Loki Helm Chart Deployment Pitfalls
|
||||
|
||||
## Problem
|
||||
Deploying the Grafana Loki Helm chart in single-binary mode with Terraform hits
|
||||
multiple non-obvious failures that aren't documented together.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying Loki via `helm_release` in Terraform
|
||||
- Using `deploymentMode: SingleBinary` with filesystem storage on NFS
|
||||
- First-time deployment or redeployment after failures
|
||||
|
||||
## Pitfall 1: Read-Only Root Filesystem
|
||||
|
||||
**Error:** `mkdir /loki/compactor: read-only file system`
|
||||
|
||||
**Cause:** The Loki Helm chart runs containers with a read-only root filesystem
|
||||
for security. The compactor `working_directory` and ruler `rule_path` default to
|
||||
paths under `/loki/` which is on the read-only root FS.
|
||||
|
||||
**Fix:** Use paths under `/var/loki/` — the Helm chart mounts the persistence
|
||||
volume there:
|
||||
```yaml
|
||||
compactor:
|
||||
working_directory: /var/loki/compactor # NOT /loki/compactor
|
||||
ruler:
|
||||
rule_path: /var/loki/scratch # NOT /loki/scratch
|
||||
```
|
||||
|
||||
## Pitfall 2: Canary Required
|
||||
|
||||
**Error:** `Helm test requires the Loki Canary to be enabled`
|
||||
|
||||
**Cause:** The Loki Helm chart's validation template requires `lokiCanary.enabled`
|
||||
to be true. You cannot disable it.
|
||||
|
||||
**Fix:** Leave `lokiCanary` enabled (default). You can disable `gateway`,
|
||||
`chunksCache`, and `resultsCache` to reduce resource usage:
|
||||
```yaml
|
||||
gateway:
|
||||
enabled: false
|
||||
chunksCache:
|
||||
enabled: false
|
||||
resultsCache:
|
||||
enabled: false
|
||||
# Do NOT add: lokiCanary: enabled: false
|
||||
```
|
||||
|
||||
## Pitfall 3: Stale Helm Release After Failed Atomic Deploy
|
||||
|
||||
**Error:** `cannot re-use a name that is still in use`
|
||||
|
||||
**Cause:** When `atomic = true` and the deploy fails, Helm rolls back but
|
||||
sometimes leaves a stale release secret in Kubernetes. Terraform then can't
|
||||
create a new release with the same name.
|
||||
|
||||
**Fix:** Delete the stale Helm secret:
|
||||
```bash
|
||||
kubectl delete secret -n monitoring sh.helm.release.v1.loki.v1
|
||||
```
|
||||
Also consider removing `atomic = true` for initial deployments and adding it
|
||||
back after the first successful install. Use a longer `timeout` (600s+) for
|
||||
first deploy since image pulls take time.
|
||||
|
||||
## Pitfall 4: PV Stuck in Released State
|
||||
|
||||
**Symptom:** PV shows `Released` status, PVC can't bind, Loki pod stuck in Pending.
|
||||
|
||||
**Cause:** After a failed Helm deploy, the PVC is deleted but the PV retains a
|
||||
`claimRef` to the old PVC. New PVCs can't bind to a `Released` PV.
|
||||
|
||||
**Fix:** Clear the stale claimRef:
|
||||
```bash
|
||||
kubectl patch pv loki --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
|
||||
```
|
||||
The PV will transition from `Released` to `Available` and can be bound again.
|
||||
|
||||
## Pitfall 5: "Entry Too Far Behind" Log Spam
|
||||
|
||||
**Error:** `entry too far behind, entry timestamp is: ... oldest acceptable timestamp is: ...`
|
||||
|
||||
**Cause:** Alloy reads all historical log files from the Kubernetes API on first
|
||||
startup. Old entries are rejected by Loki's ingester because they're behind the
|
||||
newest entry for that stream.
|
||||
|
||||
**Fix:** This is harmless and self-resolving — Alloy catches up to present time
|
||||
and errors stop. To clear immediately:
|
||||
```bash
|
||||
kubectl rollout restart ds -n monitoring alloy
|
||||
```
|
||||
After restart, Alloy tails from approximately "now" for each container.
|
||||
|
||||
## Pitfall 6: Alertmanager Service Name
|
||||
|
||||
**Symptom:** Loki ruler alerts never fire despite correct LogQL rules.
|
||||
|
||||
**Cause:** The Prometheus Helm chart names the Alertmanager service
|
||||
`prometheus-alertmanager`, not `alertmanager`. Using the wrong name causes
|
||||
silent alert delivery failures.
|
||||
|
||||
**Fix:**
|
||||
```yaml
|
||||
ruler:
|
||||
alertmanager_url: http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
|
||||
```
|
||||
Verify the actual service name: `kubectl get svc -n monitoring | grep alertmanager`
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Loki pod running
|
||||
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
|
||||
|
||||
# Loki receiving logs
|
||||
kubectl port-forward -n monitoring svc/loki 3100:3100 &
|
||||
curl -s 'http://localhost:3100/loki/api/v1/labels'
|
||||
# Should return JSON with namespace, pod, container labels
|
||||
|
||||
# PV bound
|
||||
kubectl get pv loki
|
||||
# STATUS should be "Bound"
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Always check PV status before retrying a failed deploy
|
||||
- The Loki Helm chart creates many components by default (gateway, canary,
|
||||
memcached caches) — disable what you don't need for single-binary mode
|
||||
- WAL directory can be on tmpfs (emptyDir with `medium: Memory`) for
|
||||
disk-friendly setups, but data is lost on pod crash
|
||||
- See also: `helm-release-force-rerender` for Helm values not updating resources
|
||||
|
|
@ -1,148 +0,0 @@
|
|||
---
|
||||
name: music-assistant-librespot-wrong-account
|
||||
description: |
|
||||
Fix for Music Assistant Spotify playback failing with "librespot does not support free
|
||||
accounts" even when the Spotify account has Premium. Use when: (1) Songs load for 1-2
|
||||
seconds then auto-pause, (2) Music Assistant logs show "librespot does not support free
|
||||
accounts" followed by FFmpeg "Invalid data found when processing input" exit code 183,
|
||||
(3) Spotify provider shows "Successfully logged in" but streaming fails. Root cause is
|
||||
stale librespot credential cache pointing to a different (free-tier) Spotify account.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# Music Assistant Librespot Wrong Account / Stale Credentials
|
||||
|
||||
## Problem
|
||||
Music Assistant (MASS) Spotify playback fails immediately — songs appear to load for 1-2
|
||||
seconds then auto-pause. Every track is marked "unplayable". The error log shows librespot
|
||||
rejecting the account as "free" despite the configured Spotify account having Premium.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Music Assistant addon on Home Assistant (tested with v2.7.8, addon `d5369777_music_assistant`)
|
||||
- Symptoms: Song starts loading, pauses after 1-2 seconds, skipped as "unplayable"
|
||||
- Log pattern (all three appear together on every play attempt):
|
||||
```
|
||||
WARNING [music_assistant.spotify] [librespot] librespot does not support "free" accounts.
|
||||
WARNING [music_assistant.audio.media_stream] Error opening input: Invalid data found when processing input
|
||||
ERROR [music_assistant.streams] AudioError while streaming queue item ... FFMpeg exited with code 183
|
||||
```
|
||||
- OAuth login succeeds: `Successfully logged in to Spotify as <Name>`
|
||||
- But librespot streaming fails with the "free" account error
|
||||
|
||||
## Root Cause
|
||||
Music Assistant uses **two separate auth mechanisms** for Spotify:
|
||||
1. **OAuth (PKCE flow)** — for browsing, search, metadata. Uses access tokens refreshed via
|
||||
the Spotify Web API. This is what produces the "Successfully logged in" message.
|
||||
2. **Librespot** — for actual audio streaming. Uses cached credentials stored in
|
||||
`/data/.cache/spotify--<id>/credentials.json` inside the addon container.
|
||||
|
||||
The librespot credential cache can become stale or point to a **different Spotify account**
|
||||
(e.g., if another family member logged in, or credentials were cached from before a Premium
|
||||
upgrade). Librespot uses these cached credentials to connect to Spotify's internal API, which
|
||||
returns a `ProductInfo` XML packet containing the account `type`. If the cached account is
|
||||
"free", librespot calls `exit(1)`, killing the audio pipeline before FFmpeg receives any data.
|
||||
|
||||
## How Librespot Determines Account Type
|
||||
Librespot reads the `type` field from Spotify's `ProductInfo` server packet
|
||||
(`librespot-org/librespot`, `core/src/session.rs`):
|
||||
```rust
|
||||
fn check_catalogue(attributes: &UserAttributes) {
|
||||
if let Some(account_type) = attributes.get("type") {
|
||||
if account_type != "premium" {
|
||||
error!("librespot does not support {account_type:?} accounts.");
|
||||
exit(1);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
The check is an exact string match against `"premium"`.
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Verify the Problem
|
||||
Check Music Assistant addon logs for the "free accounts" error:
|
||||
```bash
|
||||
# Via HA API (from a machine with the HA token)
|
||||
python3 -c "
|
||||
import os, json, requests
|
||||
url = os.environ.get('HOME_ASSISTANT_SOFIA_URL', '').rstrip('/')
|
||||
token = os.environ.get('HOME_ASSISTANT_SOFIA_TOKEN', '')
|
||||
headers = {'Authorization': f'Bearer {token}'}
|
||||
r = requests.get(f'{url}/api/hassio/addons/d5369777_music_assistant/logs', headers=headers)
|
||||
for line in r.text.split('\n'):
|
||||
if 'free' in line.lower() or 'librespot' in line.lower():
|
||||
print(line)
|
||||
"
|
||||
```
|
||||
|
||||
### Step 2: Identify the Music Assistant Container
|
||||
From the SSH addon (ha-sofia: `ssh vbarzin@192.168.1.8`):
|
||||
```bash
|
||||
sudo curl -s --unix-socket /run/docker.sock http://localhost/containers/json | \
|
||||
python3 -c "import sys,json; [print(c['Names'][0], c['Id'][:12]) for c in json.load(sys.stdin) if 'music' in c['Names'][0].lower()]"
|
||||
```
|
||||
|
||||
### Step 3: Check Cached Credentials
|
||||
Exec into the container to read the librespot cache:
|
||||
```bash
|
||||
# Create exec
|
||||
EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/containers/<CONTAINER_ID>/exec" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"Cmd":["cat","/data/.cache/spotify--5s3mSP8y/credentials.json"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
|
||||
|
||||
# Run exec
|
||||
sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/exec/$EXEC_ID/start" \
|
||||
-H 'Content-Type: application/json' -d '{"Detach":false}'
|
||||
```
|
||||
Check the `username` field — if it doesn't match the expected Premium account, that's the problem.
|
||||
|
||||
### Step 4: Clear the Cache
|
||||
```bash
|
||||
# Create exec to delete cache
|
||||
EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/containers/<CONTAINER_ID>/exec" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"Cmd":["rm","-rf","/data/.cache/spotify--5s3mSP8y"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
|
||||
|
||||
# Run exec
|
||||
sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/exec/$EXEC_ID/start" \
|
||||
-H 'Content-Type: application/json' -d '{"Detach":false}'
|
||||
```
|
||||
|
||||
### Step 5: Restart Music Assistant
|
||||
```bash
|
||||
sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/containers/<CONTAINER_ID>/restart" -X POST
|
||||
```
|
||||
|
||||
### Step 6: Verify
|
||||
After restart, check logs for:
|
||||
- `Successfully logged in to Spotify as <Name>` (OAuth OK)
|
||||
- No "free accounts" error when playing a track
|
||||
- Optionally re-check `/data/.cache/spotify--5s3mSP8y/credentials.json` to confirm the
|
||||
`username` now matches the Premium account
|
||||
|
||||
## Verification
|
||||
1. Play any Spotify track through Music Assistant
|
||||
2. The track should stream without pausing after 1-2 seconds
|
||||
3. Logs should show `Start Queue Flow stream` without subsequent `AudioError`
|
||||
|
||||
## Notes
|
||||
- The cache directory name `spotify--5s3mSP8y` is an internal Music Assistant provider ID
|
||||
and may differ across installations. Use `find /data -name credentials.json` to locate it.
|
||||
- The `username` field in the credentials cache is Spotify's internal user ID (numeric for
|
||||
newer accounts, text for older ones), not necessarily the display name or email.
|
||||
- Spotify Family plan **owners** have account type `"premium"`. Family plan **members** also
|
||||
report as `"premium"` when their membership is active.
|
||||
- If the problem recurs, it may indicate that Music Assistant's Spotify provider re-caches
|
||||
the wrong credentials — check if multiple Spotify accounts are configured or if another
|
||||
user logged in via the Music Assistant UI.
|
||||
- The SSH addon on HA OS needs `sudo` for Docker socket access (`/run/docker.sock` is owned
|
||||
by `root:messagebus`).
|
||||
- The HA long-lived token typically does NOT have Supervisor API access (hassio endpoints
|
||||
return 401), so addon management must go through the Docker socket from the SSH addon.
|
||||
|
|
@ -1,128 +0,0 @@
|
|||
---
|
||||
name: nextcloud-calendar
|
||||
description: |
|
||||
Create, list, and query calendar events in Nextcloud via CalDAV. Use when:
|
||||
(1) User asks to create a calendar event, (2) User asks what's on their calendar,
|
||||
(3) User says "add to calendar" or "schedule", (4) User asks about upcoming events.
|
||||
Always use Nextcloud calendar unless user specifies otherwise.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-25
|
||||
---
|
||||
|
||||
# Nextcloud Calendar Management
|
||||
|
||||
## Problem
|
||||
Need to create, query, or manage calendar events in the user's Nextcloud calendar.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- User asks to create/add a calendar event
|
||||
- User asks "what's on my calendar?" or similar
|
||||
- User mentions scheduling something
|
||||
- User says "remind me" with a date (create calendar event)
|
||||
- Default calendar is always Nextcloud unless otherwise specified
|
||||
|
||||
## Prerequisites
|
||||
- Python 3 with `caldav` and `icalendar` packages available (installed via PYTHONPATH or system packages)
|
||||
- Environment variables `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` must be set
|
||||
|
||||
## Solution
|
||||
|
||||
### Script Location
|
||||
```
|
||||
.claude/calendar-query.py
|
||||
```
|
||||
|
||||
### Execution Pattern (CRITICAL)
|
||||
Run the script directly with python3 (env vars are set in the environment):
|
||||
|
||||
```bash
|
||||
python3 .claude/calendar-query.py [command] [options]
|
||||
```
|
||||
|
||||
### Available Commands
|
||||
|
||||
#### List Calendars
|
||||
```bash
|
||||
python .claude/calendar-query.py list
|
||||
```
|
||||
|
||||
#### Query Events
|
||||
```bash
|
||||
# Today's events
|
||||
python .claude/calendar-query.py today
|
||||
|
||||
# Tomorrow's events
|
||||
python .claude/calendar-query.py tomorrow
|
||||
|
||||
# This week
|
||||
python .claude/calendar-query.py week
|
||||
|
||||
# This month
|
||||
python .claude/calendar-query.py month
|
||||
|
||||
# Custom date range
|
||||
python .claude/calendar-query.py events --days 14
|
||||
python .claude/calendar-query.py events --date 2026-04-10
|
||||
|
||||
# From specific calendar
|
||||
python .claude/calendar-query.py today --calendar "Work"
|
||||
```
|
||||
|
||||
#### Create Events
|
||||
```bash
|
||||
# All-day event (single day)
|
||||
python .claude/calendar-query.py create --title "Doctor appointment" --start "2026-03-15" --all-day
|
||||
|
||||
# All-day event (multi-day) - end date is EXCLUSIVE
|
||||
# For April 10-13, use end date April 14
|
||||
python .claude/calendar-query.py create --title "Vacation" --start "2026-04-10" --end "2026-04-14" --all-day
|
||||
|
||||
# Timed event
|
||||
python .claude/calendar-query.py create --title "Meeting" --start "2026-03-15 14:00" --end "2026-03-15 15:00"
|
||||
|
||||
# With location and description
|
||||
python .claude/calendar-query.py create --title "Lunch" --start "tomorrow 12:00" --location "Cafe" --description "Team lunch"
|
||||
|
||||
# Relative dates work
|
||||
python .claude/calendar-query.py create --title "Call" --start "today 16:00"
|
||||
python .claude/calendar-query.py create --title "Review" --start "tomorrow 10:00"
|
||||
```
|
||||
|
||||
### Output Formats
|
||||
```bash
|
||||
# JSON output (for parsing)
|
||||
python .claude/calendar-query.py today --json
|
||||
|
||||
# Text output (default, human-readable)
|
||||
python .claude/calendar-query.py week
|
||||
```
|
||||
|
||||
## Complete Example
|
||||
|
||||
To create an event "Team offsite" from March 20-22, 2026:
|
||||
|
||||
```bash
|
||||
python3 .claude/calendar-query.py create --title "Team offsite" --start "2026-03-20" --end "2026-03-23" --all-day
|
||||
```
|
||||
|
||||
## Important Notes
|
||||
|
||||
1. **End dates are exclusive** for all-day events (CalDAV standard). To create an event spanning April 10-13, set end to April 14.
|
||||
|
||||
2. **No delete/update commands** - The script currently only supports create and query. To modify events, user must do it manually in Nextcloud.
|
||||
|
||||
4. **Default calendar** is "Personal" - use `--calendar` flag for others.
|
||||
|
||||
## Verification
|
||||
- For queries: Output shows formatted event list
|
||||
- For creates: Output shows "Event created: [title]" with calendar name and start date
|
||||
- Exit code 0 = success, 1 = error (check output for details)
|
||||
|
||||
## Common Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| `NEXTCLOUD_USER and NEXTCLOUD_APP_PASSWORD must be set` | Env vars not set | Ensure `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` are in the environment |
|
||||
| `Required packages not installed` | caldav/icalendar missing | Ensure PYTHONPATH includes the installed packages |
|
||||
| `Calendar 'X' not found` | Wrong calendar name | Run `list` command to see available calendars |
|
||||
|
|
@ -1,132 +0,0 @@
|
|||
---
|
||||
name: nfsv4-idmapd-uid-mapping
|
||||
description: |
|
||||
Fix for all file UIDs showing as 65534 (nobody) inside Kubernetes containers when using
|
||||
NFS volumes from TrueNAS/FreeBSD. Use when: (1) ls -lan inside a container shows all files
|
||||
owned by 65534:65534 despite correct ownership on the NFS server, (2) PostgreSQL fails with
|
||||
"data directory has wrong ownership", (3) chown inside containers returns "Invalid argument"
|
||||
on NFS volumes, (4) services that check file ownership (PostgreSQL, MySQL) crash on startup,
|
||||
(5) the same NFS mount shows correct UIDs on the host but 65534 inside containers,
|
||||
(6) NFSv4.2 appears in container mount output even though host mounts use NFSv3.
|
||||
Root cause: Kubernetes inline NFS volumes auto-negotiate NFSv4.2 (not NFSv3), and NFSv4
|
||||
idmapd fails to map UIDs when domains don't match or users don't exist on the server.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# NFSv4 idmapd UID Mapping — All Files Show as nobody (65534)
|
||||
|
||||
## Problem
|
||||
All files on NFS volumes appear owned by UID 65534 (nobody:nogroup) inside Kubernetes
|
||||
containers, even though `ls -lan` on the NFS server shows the correct UIDs (e.g., 999, 472).
|
||||
This breaks any service that checks file ownership: PostgreSQL refuses to start ("data
|
||||
directory has wrong ownership"), MySQL's entrypoint `chown` fails with "Invalid argument",
|
||||
and any `chown` inside the container returns EINVAL.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- TrueNAS CORE (FreeBSD) or TrueNAS SCALE as NFS server
|
||||
- NFSv4 enabled on the NFS server (`v4: true` in TrueNAS NFS config)
|
||||
- Kubernetes using inline NFS volumes (not PV/PVC with mount options)
|
||||
- **Key symptom**: `mount` inside the container shows `type nfs4 (vers=4.2,...)` even
|
||||
though existing kubelet mounts on the host show `vers=3`
|
||||
- **Key symptom**: Same NFS path mounted directly on the host shows correct UIDs, but
|
||||
inside any container shows 65534
|
||||
|
||||
## Root Cause
|
||||
|
||||
Kubernetes inline NFS volumes don't support `mountOptions`. When kubelet mounts NFS for a
|
||||
new pod, the Linux NFS client auto-negotiates the highest available version — NFSv4.2 if
|
||||
the server supports it.
|
||||
|
||||
NFSv4 uses **idmapd** for UID translation: the server translates UID→username (e.g.,
|
||||
`999→postgres@domain`), sends the username string over the wire, and the client translates
|
||||
it back to a local UID. This fails when:
|
||||
|
||||
1. **Domain mismatch**: Server domain (from hostname) differs from client domain
|
||||
- TrueNAS: `viktorbarzin.me` (from `truenas.viktorbarzin.me`)
|
||||
- K8s nodes: `viktorbarzin.lan` (from `k8s-node4.viktorbarzin.lan`)
|
||||
- When domains don't match, ALL UIDs fall back to `nobody` (65534)
|
||||
|
||||
2. **Unknown UIDs**: Even with matching domains, if the NFS server has no local user for
|
||||
UID 999 (common for container UIDs), idmapd maps it to `nobody`
|
||||
|
||||
**Why existing mounts work**: Older kubelet mounts (established before NFSv4 was enabled,
|
||||
or when the NFS client defaulted to v3) continue using NFSv3 with direct numeric UID
|
||||
passthrough. Only NEW mounts negotiate NFSv4.2.
|
||||
|
||||
## Solution
|
||||
|
||||
**Fix on TrueNAS (no NFS restart required):**
|
||||
|
||||
```bash
|
||||
# 1. Enable NFSv3-style numeric UID passthrough for NFSv4
|
||||
midclt call nfs.update '{"v4_v3owner": true, "v4_domain": "viktorbarzin.lan"}'
|
||||
|
||||
# 2. Restart nfsuserd with the correct domain (NOT nfsd — that would crash the cluster)
|
||||
killall nfsuserd
|
||||
nfsuserd -domain viktorbarzin.lan -force
|
||||
```
|
||||
|
||||
**Clear caches on all K8s nodes:**
|
||||
|
||||
```bash
|
||||
for node in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
ssh wizard@$node "sudo nfsidmap -c && sudo keyctl clear @u"
|
||||
done
|
||||
```
|
||||
|
||||
**Key settings explained:**
|
||||
- `v4_v3owner = true`: Makes NFSv4 use numeric UID passthrough like NFSv3, completely
|
||||
bypassing the username-based idmapd translation. **This is the critical fix.**
|
||||
- `v4_domain`: Should match the K8s nodes' DNS domain (check with `hostname -d` on a node)
|
||||
- `nfsuserd -domain <domain> -force`: FreeBSD daemon that handles NFSv4 user mapping.
|
||||
The `-force` flag is required if it thinks it's already running.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Run a test pod and check UIDs
|
||||
kubectl run nfs-test --rm -it --restart=Never --image=alpine \
|
||||
--overrides='{"spec":{"containers":[{"name":"test","image":"alpine",
|
||||
"command":["sh","-c","ls -lan /data | head -5"],
|
||||
"volumeMounts":[{"name":"nfs","mountPath":"/data"}]}],
|
||||
"volumes":[{"name":"nfs","nfs":{"server":"10.0.10.15","path":"/mnt/main/some-path"}}]}}'
|
||||
|
||||
# Should show actual UIDs (e.g., 999, 472) instead of 65534
|
||||
```
|
||||
|
||||
## Debugging Steps
|
||||
|
||||
If you're not sure whether this is the issue:
|
||||
|
||||
```bash
|
||||
# 1. Check mount type INSIDE a container (not on the host!)
|
||||
kubectl exec <pod> -- mount | grep nfs
|
||||
# If it shows "type nfs4" with "vers=4.2" — this is the issue
|
||||
|
||||
# 2. Compare UIDs: host vs container
|
||||
# On host (via kubelet mount path):
|
||||
sudo ls -lan /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~nfs/<vol>/
|
||||
# Inside container:
|
||||
kubectl exec <pod> -- ls -lan /mount-path/
|
||||
|
||||
# 3. Check TrueNAS NFS config
|
||||
midclt call nfs.config # Look for v4: true, v4_v3owner, v4_domain
|
||||
|
||||
# 4. Check nfsuserd is running with the right domain
|
||||
ps aux | grep nfsuserd # On TrueNAS
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **NEVER restart NFS (nfsd)** on TrueNAS — it causes mount failures across ALL pods
|
||||
cluster-wide. Only restart `nfsuserd` (the ID mapping daemon).
|
||||
- Existing NFSv3 mounts continue working fine. The issue only affects NEW mounts.
|
||||
- The `v4_v3owner` setting is persistent across TrueNAS reboots (stored in middleware config).
|
||||
- The `nfsuserd` restart is NOT persistent — TrueNAS may restart it without the `-domain`
|
||||
flag after a reboot. The `v4_domain` setting in the middleware config should handle this,
|
||||
but verify after any TrueNAS restart.
|
||||
- On Linux NFS servers (not FreeBSD/TrueNAS), the equivalent fix is setting `Domain` in
|
||||
`/etc/idmapd.conf` on both server and all clients.
|
||||
|
|
@ -1,216 +0,0 @@
|
|||
---
|
||||
name: openclaw-k8s-deployment
|
||||
description: |
|
||||
Deploy and troubleshoot OpenClaw gateway on Kubernetes. Use when:
|
||||
(1) OpenClaw gateway won't start or shows "Telegram configured, not enabled yet",
|
||||
(2) exec fails with "requires a paired node (none available)",
|
||||
(3) gateway shows "Config invalid" for exec.host or exec.security values,
|
||||
(4) OpenClaw can't write files (EACCES on workspace or home),
|
||||
(5) gateway takes 5+ minutes to start (CPU throttling by VPA/LimitRange),
|
||||
(6) 502 Bad Gateway from Traefik after pod restart,
|
||||
(7) setting up Telegram bot channel,
|
||||
(8) configuring modelrelay sidecar for free model routing.
|
||||
Covers all non-obvious deployment gotchas discovered through trial and error.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# OpenClaw Kubernetes Deployment
|
||||
|
||||
## Problem
|
||||
Deploying OpenClaw as a Kubernetes pod involves many non-obvious configuration
|
||||
requirements. The gateway process, Telegram integration, exec permissions, and
|
||||
file ownership all have specific constraints not documented together.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying OpenClaw from `ghcr.io/openclaw/openclaw` container image
|
||||
- Running in Kubernetes with NFS volumes, Traefik ingress, Goldilocks/VPA
|
||||
- Want Telegram bot integration, tool execution, and persistent state
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Gateway Configuration (openclaw.json)
|
||||
|
||||
**Required fields that aren't obvious:**
|
||||
|
||||
```json
|
||||
{
|
||||
"gateway": {
|
||||
"mode": "local",
|
||||
"bind": "lan",
|
||||
"controlUi": {
|
||||
"dangerouslyDisableDeviceAuth": true,
|
||||
"dangerouslyAllowHostHeaderOriginFallback": true
|
||||
}
|
||||
},
|
||||
"wizard": {
|
||||
"lastRunAt": "2026-03-01T00:00:00.000Z",
|
||||
"lastRunVersion": "2026.2.26",
|
||||
"lastRunCommand": "configure",
|
||||
"lastRunMode": "local"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- `gateway.mode = "local"` — **required** or gateway refuses to start
|
||||
- `dangerouslyAllowHostHeaderOriginFallback = true` — required in v2026.2.26+
|
||||
for non-loopback Control UI (error: "non-loopback Control UI requires
|
||||
gateway.controlUi.allowedOrigins")
|
||||
- `wizard` block — **required** for Telegram to start. Without it, gateway logs
|
||||
"Telegram configured, not enabled yet" on every startup. The wizard block
|
||||
signals that initial setup was completed.
|
||||
|
||||
### 2. Exec Configuration
|
||||
|
||||
Valid values for `tools.exec`:
|
||||
|
||||
| Field | Valid Values | Notes |
|
||||
|-------|-------------|-------|
|
||||
| `host` | `sandbox`, `gateway`, `node` | NOT "local" — that's invalid |
|
||||
| `security` | `deny`, `allowlist`, `full` | NOT "off" — that's invalid |
|
||||
| `ask` | `"off"` | Disables confirmation prompts |
|
||||
|
||||
- `host = "gateway"` — runs commands on the container host directly
|
||||
- `host = "node"` — requires a "paired node" companion app (doesn't work in containers)
|
||||
- `host = "sandbox"` — requires Docker-in-Docker
|
||||
- `security = "full"` — most permissive valid option
|
||||
|
||||
### 3. Sandbox Mode
|
||||
|
||||
```json
|
||||
{
|
||||
"agents": {
|
||||
"defaults": {
|
||||
"sandbox": { "mode": "off" },
|
||||
"workspace": "/workspace/infra"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- `sandbox.mode = "off"` disables Docker sandboxing
|
||||
- `workspace` must be set explicitly — defaults to `~/.openclaw/workspace`
|
||||
|
||||
### 4. File Permissions
|
||||
|
||||
The init container runs as root but the main container runs as `node` (UID 1000).
|
||||
|
||||
**Must chown in init container:**
|
||||
```sh
|
||||
chown -R 1000:1000 /workspace/infra
|
||||
chown -R 1000:1000 /openclaw-home
|
||||
chmod 700 /openclaw-home
|
||||
```
|
||||
|
||||
**Must create directories:**
|
||||
```sh
|
||||
mkdir -p /openclaw-home/agents/main/sessions \
|
||||
/openclaw-home/credentials \
|
||||
/openclaw-home/canvas \
|
||||
/openclaw-home/devices \
|
||||
/openclaw-home/cron
|
||||
```
|
||||
|
||||
Without these: `EACCES: permission denied` errors for AGENTS.md, canvas,
|
||||
cron/jobs.json, devices, and other runtime files.
|
||||
|
||||
### 5. Startup Command
|
||||
|
||||
```sh
|
||||
node openclaw.mjs doctor --fix 2>/dev/null; exec node openclaw.mjs gateway --allow-unconfigured --bind lan
|
||||
```
|
||||
|
||||
Run `doctor --fix` before the gateway to auto-enable Telegram and fix
|
||||
config issues. Without this, Telegram stays "not enabled yet".
|
||||
|
||||
### 6. Resource Requirements
|
||||
|
||||
- **CPU limit: 2 cores minimum** — the Node.js gateway startup is CPU-intensive.
|
||||
With 150-300m CPU, startup takes 5+ minutes.
|
||||
- **Memory limit: 2Gi minimum** — the gateway OOM-kills at 1Gi during startup
|
||||
(V8 heap exhaustion).
|
||||
- **Goldilocks VPA will override these** — see "VPA Override" section below.
|
||||
|
||||
### 7. Readiness Probe
|
||||
|
||||
```hcl
|
||||
readiness_probe {
|
||||
tcp_socket { port = 18789 }
|
||||
initial_delay_seconds = 30
|
||||
period_seconds = 10
|
||||
}
|
||||
```
|
||||
|
||||
Do NOT use a startup probe — the gateway can take 2-3 minutes to start listening
|
||||
and a startup probe will kill it. Use readiness-only to prevent 502s from Traefik
|
||||
during startup without killing the container.
|
||||
|
||||
### 8. Telegram Integration
|
||||
|
||||
```json
|
||||
{
|
||||
"channels": {
|
||||
"telegram": {
|
||||
"enabled": true,
|
||||
"botToken": "...",
|
||||
"dmPolicy": "allowlist",
|
||||
"allowFrom": ["tg:USER_ID"],
|
||||
"groupPolicy": "allowlist",
|
||||
"streamMode": "partial"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Telegram won't start without:
|
||||
1. The `wizard` block in config (signals setup was run)
|
||||
2. `doctor --fix` at startup (auto-enables the channel)
|
||||
3. Both `groupPolicy` and `streamMode` fields
|
||||
|
||||
### 9. NFS Volume Strategy
|
||||
|
||||
| Volume | Purpose | Type |
|
||||
|--------|---------|------|
|
||||
| `/home/node/.openclaw` | Persistent state (SOUL.md, sessions, memory, telegram) | NFS |
|
||||
| `/tools` | Cached binaries (kubectl, terraform, terragrunt, python libs) | NFS |
|
||||
| `/workspace` | Infra repo clone | NFS |
|
||||
| `/data` | General data | NFS |
|
||||
|
||||
Using NFS for tools cache reduces restart time from ~2.5min to ~38s by skipping
|
||||
binary downloads and pip installs on subsequent starts.
|
||||
|
||||
### 10. ModelRelay Sidecar
|
||||
|
||||
Deploy as a sidecar container for automatic free model routing:
|
||||
|
||||
```hcl
|
||||
container {
|
||||
name = "modelrelay"
|
||||
image = "node:22-alpine"
|
||||
command = ["sh", "-c", "npm install -g modelrelay; exec modelrelay --port 7352"]
|
||||
env { name = "NVIDIA_API_KEY"; value = "..." }
|
||||
env { name = "OPENROUTER_API_KEY"; value = "..." }
|
||||
}
|
||||
```
|
||||
|
||||
Configure as provider: `baseUrl = "http://127.0.0.1:7352/v1"`, model `auto-fastest`.
|
||||
|
||||
## Verification
|
||||
1. `kubectl logs -c openclaw` should show `[gateway] listening on ws://0.0.0.0:18789`
|
||||
2. No "Telegram configured, not enabled yet" message
|
||||
3. No `EACCES` permission errors
|
||||
4. `kubectl exec ... -- cat /proc/net/tcp` shows listening sockets
|
||||
5. Telegram bot responds to `/start`
|
||||
|
||||
## Notes
|
||||
- ConfigMap changes require pod restart (init container copies config at start)
|
||||
- ConfigMap taint+reinit sometimes needed when Terraform state gets out of sync
|
||||
- Goldilocks VPA recreates itself from namespace labels — must delete VPA on
|
||||
every pod recreation if namespace has `goldilocks.fairwinds.com/vpa-update-mode`
|
||||
- The `--allow-unconfigured` flag is needed for the gateway command
|
||||
- v2026.2.26 introduced breaking change requiring `dangerouslyAllowHostHeaderOriginFallback`
|
||||
|
||||
## See also
|
||||
- `openclaw-custom-model-provider` — basic model provider configuration
|
||||
- `k8s-limitrange-oom-silent-kill` — LimitRange causing OOM (related but different)
|
||||
|
|
@ -1,169 +0,0 @@
|
|||
---
|
||||
name: pfsense-dnsmasq-interface-binding
|
||||
description: |
|
||||
Restrict pfSense dnsmasq (DNS Forwarder) to specific interfaces to free port 53 on
|
||||
other interfaces for port forwarding. Use when: (1) pfSense blocks port 53 NAT port
|
||||
forward because dnsmasq is listening on *:53, (2) need to forward DNS from WAN to an
|
||||
internal DNS server while preserving client source IPs, (3) dnsmasq shows *:53 in
|
||||
sockstat despite --listen-address flags, (4) pfSense loses DNS resolution after
|
||||
restricting dnsmasq interfaces, (5) NAT rdr rules for port 53 silently fail to
|
||||
generate in /tmp/rules.debug.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# pfSense dnsmasq Interface Binding for DNS Port Forwarding
|
||||
|
||||
## Problem
|
||||
pfSense's dnsmasq (DNS Forwarder) binds to `*:53` by default. This prevents creating
|
||||
NAT port forward rules for port 53 — pfSense silently skips generating the pf `rdr`
|
||||
directive. You need to restrict dnsmasq to specific interfaces to free port 53 on other
|
||||
interfaces (e.g., WAN) for forwarding to an internal DNS server.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Attempting to create a NAT port forward for port 53 on the WAN interface
|
||||
- Port forward rule saves to config.xml but `pfctl -sn` shows no corresponding `rdr` rule
|
||||
- `sockstat -4 | grep ":53"` shows `dnsmasq` on `*:53`
|
||||
- Goal: Forward DNS queries from one network to an internal DNS server (e.g., Technitium)
|
||||
while preserving client source IPs (no masquerading)
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Bind dnsmasq to specific interfaces
|
||||
|
||||
Set the interface field in pfSense's dnsmasq config:
|
||||
|
||||
```php
|
||||
ssh admin@10.0.20.1 'php -r '"'"'
|
||||
require_once("config.inc");
|
||||
require_once("service-utils.inc");
|
||||
global $config;
|
||||
$config = parse_config(true);
|
||||
$config["dnsmasq"]["interface"] = "lan,opt1"; // Only LAN and OPT1, NOT wan
|
||||
write_config("Bind dnsmasq to LAN and OPT1 only");
|
||||
'"'"''
|
||||
```
|
||||
|
||||
This adds `--listen-address=<IP>` flags to dnsmasq but does NOT change socket binding.
|
||||
|
||||
### Step 2: Add bind-dynamic (CRITICAL)
|
||||
|
||||
Without `bind-dynamic`, dnsmasq still binds the socket to `*:53` even with
|
||||
`--listen-address` flags. The `--listen-address` only controls which queries get
|
||||
responses, not the actual socket binding.
|
||||
|
||||
```php
|
||||
ssh admin@10.0.20.1 'php -r '"'"'
|
||||
require_once("config.inc");
|
||||
require_once("service-utils.inc");
|
||||
global $config;
|
||||
$config = parse_config(true);
|
||||
$existing = base64_decode($config["dnsmasq"]["custom_options"]);
|
||||
if (strpos($existing, "bind-dynamic") === false) {
|
||||
$existing = "bind-dynamic\n" . $existing;
|
||||
$config["dnsmasq"]["custom_options"] = base64_encode($existing);
|
||||
write_config("Add bind-dynamic to restrict dnsmasq socket binding");
|
||||
}
|
||||
'"'"''
|
||||
```
|
||||
|
||||
### Step 3: Add localhost listen address (CRITICAL)
|
||||
|
||||
pfSense's own `resolv.conf` points to `127.0.0.1`. Without this, pfSense itself
|
||||
loses DNS resolution after the interface restriction.
|
||||
|
||||
```php
|
||||
# Add to custom_options (base64-encoded in config):
|
||||
listen-address=127.0.0.1
|
||||
```
|
||||
|
||||
### Step 4: Restart dnsmasq
|
||||
|
||||
```php
|
||||
services_dnsmasq_configure();
|
||||
```
|
||||
|
||||
### Step 5: Verify binding
|
||||
|
||||
```bash
|
||||
sockstat -4 | grep ":53 "
|
||||
# Should show specific IPs, not *:53:
|
||||
# 127.0.0.1:53
|
||||
# 10.0.10.1:53 (lan)
|
||||
# 10.0.20.1:53 (opt1)
|
||||
# NOT 192.168.1.2:53 (wan)
|
||||
```
|
||||
|
||||
### Step 6: Add the port forward rule
|
||||
|
||||
**Critical format note**: The `source` field must use `array("any" => "")`, NOT
|
||||
`array("network" => "192.168.1.0/24")`. The CIDR source format silently fails to
|
||||
generate the pf `rdr` directive.
|
||||
|
||||
```php
|
||||
ssh admin@10.0.20.1 'php -r '"'"'
|
||||
require_once("config.inc");
|
||||
require_once("filter.inc");
|
||||
require_once("shaper.inc");
|
||||
global $config;
|
||||
$config = parse_config(true);
|
||||
|
||||
$rule = array(
|
||||
"source" => array("any" => ""), // MUST be "any", not CIDR
|
||||
"destination" => array(
|
||||
"network" => "wanip",
|
||||
"port" => "53"
|
||||
),
|
||||
"ipprotocol" => "inet",
|
||||
"protocol" => "udp",
|
||||
"target" => "10.0.20.204", // Internal DNS server
|
||||
"local-port" => "53",
|
||||
"interface" => "wan",
|
||||
"associated-rule-id" => "pass",
|
||||
"descr" => "DNS to internal DNS (preserve client IP)",
|
||||
"created" => array("time" => (string)time(), "username" => "admin"),
|
||||
"updated" => array("time" => (string)time(), "username" => "admin")
|
||||
);
|
||||
array_unshift($config["nat"]["rule"], $rule);
|
||||
write_config("Add DNS port forward");
|
||||
filter_configure();
|
||||
'"'"''
|
||||
```
|
||||
|
||||
### Step 7: Verify the redirect rule
|
||||
|
||||
```bash
|
||||
pfctl -sn | grep "domain\|:53"
|
||||
# Should show: rdr pass on vtnet0 inet proto udp from any to 192.168.1.2 port = domain -> 10.0.20.204
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
1. pfSense own DNS: `nslookup google.com 127.0.0.1` (from pfSense shell)
|
||||
2. Internal DNS: `nslookup google.com 10.0.20.1` (from LAN/OPT1 clients)
|
||||
3. Port forward: `dig @192.168.1.2 example.com` (from WAN-side client)
|
||||
4. Client IP: Check DNS server logs — should show real client IP, not pfSense IP
|
||||
|
||||
## Pitfalls
|
||||
|
||||
| Pitfall | Symptom | Fix |
|
||||
|---------|---------|-----|
|
||||
| Missing `bind-dynamic` | sockstat shows `*:53`, port forward still blocked | Add `bind-dynamic` to custom_options |
|
||||
| Missing `listen-address=127.0.0.1` | pfSense loses all DNS resolution | Add to custom_options |
|
||||
| Source `"network" => "CIDR"` in NAT rule | Rule saves to config but no `rdr` in `pfctl -sn` | Use `"any" => ""` instead |
|
||||
| Using local `$config` variable | Config not persisted after PHP exit | Always use `global $config` |
|
||||
| Not calling `filter_configure()` | Rule in config.xml but not in pf | Call after `write_config()` |
|
||||
| Custom options not base64 | dnsmasq fails to start | pfSense stores custom_options as base64 |
|
||||
|
||||
## Notes
|
||||
- `bind-dynamic` is preferred over `bind-interfaces` because it handles interfaces that
|
||||
come up after dnsmasq starts (e.g., VPN tunnels)
|
||||
- The pf `rdr` rule is a redirect, not masquerade — source IP is preserved
|
||||
- dnsmasq custom_options in pfSense config.xml are base64-encoded
|
||||
- Check `/tmp/rules.debug` for the generated pf ruleset (before loading into pf)
|
||||
- Use `pfctl -sn` to see rules actually loaded in the running firewall
|
||||
|
||||
## See also
|
||||
- `pfsense` — General pfSense management skill
|
||||
- `k8s-ndots-search-domain-nxdomain-flood` — Related DNS optimization
|
||||
|
|
@ -1,105 +0,0 @@
|
|||
---
|
||||
name: pfsense-nat-rule-creation
|
||||
description: |
|
||||
Create NAT port forward rules on pfSense programmatically via PHP/SSH.
|
||||
Use when: (1) adding port forwards for new K8s services, (2) NAT rules
|
||||
added via PHP don't appear in pfctl output, (3) config_read_array() throws
|
||||
"undefined function" error, (4) destination "wanip" not working in NAT rules,
|
||||
(5) rules saved to config.xml but not loaded into pfctl. Covers the correct
|
||||
PHP array structure, config API differences between pfSense versions, and
|
||||
the required pfctl reload step.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# pfSense NAT Rule Creation via PHP
|
||||
|
||||
## Problem
|
||||
Creating NAT port forward rules on pfSense programmatically via SSH/PHP has
|
||||
multiple gotchas around the config API, rule structure, and rule loading.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Adding a port forward for a new Kubernetes service (e.g., TURN, game server)
|
||||
- Using `ssh admin@10.0.20.1` + PHP to automate pfSense config
|
||||
- NAT rules don't appear in `pfctl -sn` after `write_config()` + `filter_configure()`
|
||||
- `config_read_array()` throws "Call to undefined function"
|
||||
- Rules saved to config.xml but pfctl doesn't have them
|
||||
|
||||
## Solution
|
||||
|
||||
### Correct PHP for adding NAT rules
|
||||
|
||||
```php
|
||||
<?php
|
||||
require_once("config.inc");
|
||||
require_once("filter.inc");
|
||||
global $config; // NOT config_read_array() — that doesn't exist in pfSense 2.7.x
|
||||
|
||||
$config["nat"]["rule"][] = array(
|
||||
"interface" => "wan",
|
||||
"ipprotocol" => "inet", // Required! Must be "inet" for IPv4
|
||||
"protocol" => "tcp/udp", // Or "udp" or "tcp"
|
||||
"source" => array("any" => ""),
|
||||
"destination" => array(
|
||||
"network" => "wanip", // Use "network" => "wanip", NOT "address" => "wanip"
|
||||
"port" => "3478" // Single port or "start:end" for range
|
||||
),
|
||||
"target" => "10.0.20.200", // Internal destination IP
|
||||
"local-port" => "3478", // Internal port (for ranges, just the start port)
|
||||
"descr" => "My port forward",
|
||||
"associated-rule-id" => "pass" // Auto-create firewall pass rule
|
||||
);
|
||||
|
||||
write_config("Description for config history");
|
||||
filter_configure();
|
||||
```
|
||||
|
||||
### Key gotchas
|
||||
|
||||
1. **`config_read_array()` doesn't exist** in pfSense 2.7.x. Use `global $config` instead.
|
||||
|
||||
2. **Destination format**: Use `"network" => "wanip"`, NOT `"address" => "wanip"` or `"address" => "192.168.1.2"`. The `"network"` key with `"wanip"` tells pfSense to resolve the WAN IP dynamically.
|
||||
|
||||
3. **`ipprotocol` is required**: Must include `"ipprotocol" => "inet"` or rules won't generate in `/tmp/rules.debug`.
|
||||
|
||||
4. **Port ranges**: Use `"port" => "49152:49252"` for ranges. The `"local-port"` should be just the start port — pfSense maps the range automatically.
|
||||
|
||||
5. **Rules may not load immediately**: After `write_config()` + `filter_configure()`, rules appear in `/tmp/rules.debug` but may not be in pfctl until the next filter reload. Force with:
|
||||
```bash
|
||||
pfctl -f /tmp/rules.debug
|
||||
```
|
||||
|
||||
6. **SSH quoting**: The pfsense.py `php` command breaks on `\n` in strings. For multi-line PHP, write a `.php` file, `scp` it, and execute:
|
||||
```bash
|
||||
scp script.php admin@10.0.20.1:/tmp/
|
||||
ssh admin@10.0.20.1 "php /tmp/script.php"
|
||||
```
|
||||
|
||||
### Execution via pfsense.py
|
||||
|
||||
For simple single-line PHP (no newlines or backslashes):
|
||||
```bash
|
||||
python3 .claude/pfsense.py php 'require_once("config.inc"); ...; echo "Done";'
|
||||
```
|
||||
|
||||
For complex scripts, use scp + ssh as above.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Check rules in config
|
||||
ssh admin@10.0.20.1 "grep 'YOUR_PORT' /cf/conf/config.xml"
|
||||
|
||||
# Check generated pf rules
|
||||
ssh admin@10.0.20.1 "grep 'YOUR_PORT' /tmp/rules.debug"
|
||||
|
||||
# Check active pfctl rules
|
||||
python3 .claude/pfsense.py pfctl "-sn" | grep YOUR_PORT
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Existing working NAT rules on this pfSense use the same structure (check WireGuard port 51820 as reference)
|
||||
- The `associated-rule-id: pass` auto-creates a WAN firewall rule to allow the forwarded traffic
|
||||
- pfSense applies NAT rules across ALL interfaces when using the web UI, but PHP-created rules only apply to the specified interface
|
||||
- See also: `pfsense` skill for general pfSense management
|
||||
|
|
@ -1,136 +0,0 @@
|
|||
---
|
||||
name: proxmox-vm-disk-expansion-pitfalls
|
||||
description: |
|
||||
Troubleshoot common failures when expanding Proxmox VM disks on Ubuntu 24.04
|
||||
cloud-init images and draining Kubernetes nodes. Use when: (1) growpart fails
|
||||
with "command not found" on Ubuntu cloud-init VMs, (2) grep -P fails on macOS
|
||||
with "invalid option -- P", (3) kubectl drain times out with pods stuck
|
||||
terminating, (4) filesystem shows old size after qm resize. Covers
|
||||
cloud-guest-utils installation, macOS-portable regex parsing, drain timeout
|
||||
tuning, and recovery from partial failures.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Proxmox VM Disk Expansion Pitfalls
|
||||
|
||||
## Problem
|
||||
|
||||
Expanding disk storage on Proxmox-hosted Ubuntu 24.04 cloud-init VMs (used as
|
||||
Kubernetes nodes) fails at multiple points due to missing tools, cross-platform
|
||||
incompatibilities, and Kubernetes drain timeouts.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- Running disk expansion scripts from macOS against Proxmox + Ubuntu VMs
|
||||
- Ubuntu 24.04 cloud-init images (the default k8s node template)
|
||||
- Kubernetes nodes with many pods or stateful workloads
|
||||
- Using `scripts/extend_vm_storage.sh` or similar automation
|
||||
|
||||
## Issues and Solutions
|
||||
|
||||
### 1. `growpart: command not found` on Ubuntu 24.04
|
||||
|
||||
**Symptom**: After `qm resize`, SSH into VM, run `growpart /dev/sda 1` — fails
|
||||
with "command not found". `resize2fs` then reports "Nothing to do!" because the
|
||||
partition table hasn't been updated.
|
||||
|
||||
**Root cause**: Ubuntu 24.04 cloud-init images don't include `cloud-guest-utils`
|
||||
by default. The `growpart` tool (which updates the partition table to use new
|
||||
disk space) is in this package.
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
|
||||
sudo growpart /dev/sda 1
|
||||
sudo resize2fs /dev/sda1
|
||||
```
|
||||
|
||||
**Prevention**: Check for `growpart` before attempting partition expansion:
|
||||
```bash
|
||||
if ! command -v growpart &>/dev/null; then
|
||||
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
|
||||
fi
|
||||
```
|
||||
|
||||
### 2. `grep -P` (PCRE) not available on macOS
|
||||
|
||||
**Symptom**: Script running on macOS fails with `grep: invalid option -- P`.
|
||||
|
||||
**Root cause**: macOS ships BSD grep, which doesn't support `-P` (Perl-compatible
|
||||
regex). GNU grep (from Homebrew) does, but scripts shouldn't assume it's installed.
|
||||
|
||||
**Fix**: Replace `grep -oP 'pattern\Kcapture'` with portable `sed`:
|
||||
```bash
|
||||
# BAD (GNU grep only):
|
||||
CURRENT_SIZE=$(echo "$LINE" | grep -oP 'size=\K[0-9]+G')
|
||||
|
||||
# GOOD (portable):
|
||||
CURRENT_SIZE=$(echo "$LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p')
|
||||
```
|
||||
|
||||
**General rule**: In scripts that run on macOS, avoid `grep -P`, `sed -i ''`
|
||||
vs `sed -i` differences, and `date` flag differences. Use `sed` with basic
|
||||
regex or bash built-in `[[ =~ ]]` for pattern matching.
|
||||
|
||||
### 3. `kubectl drain` timeout with stuck pods
|
||||
|
||||
**Symptom**: `kubectl drain --timeout=120s` fails with "context deadline exceeded"
|
||||
for multiple pods. Pods are evicted but don't terminate in time.
|
||||
|
||||
**Root cause**: Some pods (stateful services like ClickHouse, Paperless-ngx,
|
||||
OnlyOffice) need more time to shut down gracefully. 120s isn't enough when many
|
||||
pods are draining simultaneously.
|
||||
|
||||
**Fix**: Use `--force` flag and a longer timeout, or retry:
|
||||
```bash
|
||||
# First attempt with standard timeout
|
||||
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=120s
|
||||
|
||||
# If it fails, force with longer timeout (pods already evicting)
|
||||
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s --force
|
||||
```
|
||||
|
||||
**Note**: After a failed drain, the node is already cordoned. A second drain
|
||||
attempt only needs to wait for already-evicting pods to finish.
|
||||
|
||||
### 4. Recovery from partial failure
|
||||
|
||||
If the script fails mid-way (after drain but before uncordon):
|
||||
|
||||
```bash
|
||||
# Check VM status
|
||||
ssh root@192.168.1.127 "qm status <vmid>"
|
||||
|
||||
# Start VM if stopped
|
||||
ssh root@192.168.1.127 "qm start <vmid>"
|
||||
|
||||
# Uncordon node
|
||||
kubectl --kubeconfig $(pwd)/config uncordon <node-name>
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
After successful expansion:
|
||||
```bash
|
||||
# On the VM
|
||||
df -h /
|
||||
# Should show new size (128G disk → ~126G usable for ext4)
|
||||
|
||||
# On the cluster
|
||||
kubectl get node <name>
|
||||
# Should show Ready status
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The k8s node VMs use direct partition layout (`/dev/sda1`), not LVM, despite
|
||||
the script handling both paths
|
||||
- `growpart` returns exit code 1 for "NOCHANGE" (partition already at max) —
|
||||
this is not an error
|
||||
- Proxmox `qm resize` uses `scsi0` as the disk identifier for these VMs
|
||||
- SSH host keys may change if VMs are recreated or network changes — use
|
||||
`-o StrictHostKeyChecking=no` in automated scripts
|
||||
|
||||
See also: `extend-vm-storage.md` (the operational skill for running the script)
|
||||
|
|
@ -1,182 +0,0 @@
|
|||
---
|
||||
name: python-filename-sanitization
|
||||
description: |
|
||||
Secure filename sanitization pattern for Python web applications. Use when:
|
||||
(1) Accepting user-provided filenames for file operations, (2) Building file
|
||||
rename/upload functionality, (3) Preventing path traversal attacks (../../../etc/passwd),
|
||||
(4) Preventing shell injection through filenames, (5) FastAPI/Flask file handling.
|
||||
Provides regex-based whitelist approach with pathlib for safe file operations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-31
|
||||
---
|
||||
|
||||
# Python Filename Sanitization
|
||||
|
||||
## Problem
|
||||
User-provided filenames can contain malicious characters that enable path traversal
|
||||
attacks, shell injection, or filesystem corruption. Direct use of user input in
|
||||
file paths is a security vulnerability.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Building file upload, rename, or download functionality
|
||||
- User can specify filenames via API or form input
|
||||
- Files are stored on server filesystem
|
||||
- Need to prevent: `../`, shell metacharacters, null bytes, etc.
|
||||
|
||||
## Solution
|
||||
|
||||
### Complete Sanitization Function
|
||||
```python
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
def sanitize_filename(filename: str, max_length: int = 200) -> str:
|
||||
"""
|
||||
Sanitize a filename to prevent path traversal and shell injection.
|
||||
Only allows alphanumeric characters, spaces, hyphens, underscores,
|
||||
parentheses, and dots.
|
||||
"""
|
||||
if not filename:
|
||||
raise ValueError("Filename cannot be empty")
|
||||
|
||||
# Remove any path components (prevent path traversal)
|
||||
filename = Path(filename).name
|
||||
|
||||
# Only allow safe characters: alphanumeric, space, hyphen, underscore, parentheses, dot
|
||||
# This regex removes anything that isn't in the allowed set
|
||||
safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
|
||||
|
||||
# Collapse multiple spaces/dots
|
||||
safe_filename = re.sub(r'\s+', ' ', safe_filename)
|
||||
safe_filename = re.sub(r'\.+', '.', safe_filename)
|
||||
|
||||
# Strip leading/trailing whitespace and dots
|
||||
safe_filename = safe_filename.strip(' .')
|
||||
|
||||
# Limit length
|
||||
if len(safe_filename) > max_length:
|
||||
safe_filename = safe_filename[:max_length]
|
||||
|
||||
if not safe_filename:
|
||||
raise ValueError("Filename contains no valid characters")
|
||||
|
||||
return safe_filename
|
||||
```
|
||||
|
||||
### FastAPI Integration Example
|
||||
```python
|
||||
from fastapi import APIRouter, HTTPException
|
||||
from pydantic import BaseModel
|
||||
from pathlib import Path
|
||||
|
||||
class RenameRequest(BaseModel):
|
||||
new_name: str
|
||||
|
||||
@router.patch("/files/{file_id}/rename")
|
||||
async def rename_file(file_id: str, request: RenameRequest):
|
||||
"""Rename a file with sanitized input."""
|
||||
file_dir = Path("/data/files") / file_id
|
||||
|
||||
if not file_dir.exists():
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
|
||||
# Find existing file
|
||||
files = list(file_dir.glob("*"))
|
||||
if not files:
|
||||
raise HTTPException(status_code=404, detail="No file found")
|
||||
|
||||
current_file = files[0]
|
||||
current_extension = current_file.suffix
|
||||
|
||||
# Sanitize the new name
|
||||
try:
|
||||
safe_name = sanitize_filename(request.new_name)
|
||||
except ValueError as e:
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
|
||||
# Preserve original extension
|
||||
if not safe_name.lower().endswith(current_extension.lower()):
|
||||
safe_name = safe_name + current_extension
|
||||
|
||||
# Create new path (same directory, new filename)
|
||||
new_file = file_dir / safe_name
|
||||
|
||||
# Check for conflicts
|
||||
if new_file.exists() and new_file != current_file:
|
||||
raise HTTPException(status_code=400, detail="A file with that name already exists")
|
||||
|
||||
# Rename using pathlib (no shell commands!)
|
||||
current_file.rename(new_file)
|
||||
|
||||
return {"status": "renamed", "new_filename": safe_name}
|
||||
```
|
||||
|
||||
## Key Security Principles
|
||||
|
||||
### 1. Whitelist, Don't Blacklist
|
||||
```python
|
||||
# BAD: Trying to block dangerous characters
|
||||
filename = filename.replace('../', '').replace('\x00', '')
|
||||
|
||||
# GOOD: Only allow known-safe characters
|
||||
safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
|
||||
```
|
||||
|
||||
### 2. Use pathlib, Not Shell Commands
|
||||
```python
|
||||
# BAD: Shell command (vulnerable to injection)
|
||||
os.system(f'mv "{old_path}" "{new_path}"')
|
||||
|
||||
# GOOD: Pure Python (no shell)
|
||||
old_path.rename(new_path)
|
||||
```
|
||||
|
||||
### 3. Extract Basename First
|
||||
```python
|
||||
# BAD: User could submit "../../../etc/passwd"
|
||||
filename = user_input
|
||||
|
||||
# GOOD: Extract just the filename part
|
||||
filename = Path(user_input).name
|
||||
```
|
||||
|
||||
### 4. Validate After Sanitization
|
||||
```python
|
||||
# Ensure something remains after sanitization
|
||||
if not safe_filename:
|
||||
raise ValueError("Filename contains no valid characters")
|
||||
```
|
||||
|
||||
## Verification
|
||||
```python
|
||||
# Test cases that should be handled safely
|
||||
assert sanitize_filename("normal.txt") == "normal.txt"
|
||||
assert sanitize_filename("../../../etc/passwd") == "etcpasswd"
|
||||
assert sanitize_filename("file; rm -rf /") == "file rm -rf"
|
||||
assert sanitize_filename(" spaces .txt") == "spaces.txt"
|
||||
assert sanitize_filename("$(whoami).txt") == "whoami.txt"
|
||||
|
||||
# Test cases that should raise errors
|
||||
try:
|
||||
sanitize_filename("") # Should raise ValueError
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
try:
|
||||
sanitize_filename("$#@!") # Should raise ValueError (no valid chars)
|
||||
except ValueError:
|
||||
pass
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is intentionally restrictive; expand the regex if you need Unicode support
|
||||
- For Unicode filenames, consider `unicodedata.normalize('NFKD', ...)` first
|
||||
- Max length of 200 is conservative; filesystem limits vary (255 bytes typical)
|
||||
- Always preserve file extensions when renaming to avoid breaking file associations
|
||||
- Consider adding a UUID prefix for guaranteed uniqueness in upload scenarios
|
||||
|
||||
## References
|
||||
- [OWASP Path Traversal](https://owasp.org/www-community/attacks/Path_Traversal)
|
||||
- [CWE-22: Path Traversal](https://cwe.mitre.org/data/definitions/22.html)
|
||||
- [Python pathlib documentation](https://docs.python.org/3/library/pathlib.html)
|
||||
|
|
@ -1,116 +0,0 @@
|
|||
---
|
||||
name: sops-age-secrets-migration
|
||||
description: |
|
||||
Migrate from git-crypt to SOPS + age for multi-user secret management in a
|
||||
Terraform/Terragrunt infrastructure repo. Use when: (1) need per-user secret
|
||||
access control (git-crypt is all-or-nothing), (2) want operators to push PRs
|
||||
without seeing secrets (CI decrypts), (3) migrating from a single encrypted
|
||||
terraform.tfvars to structured secret management. Covers: JSON format (not YAML
|
||||
— Terraform can't parse YAML tfvars), race condition avoidance with parallel
|
||||
terragrunt applies, CI pipeline integration with Woodpecker, age key management,
|
||||
and the complete migration sequence.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-07
|
||||
---
|
||||
|
||||
# SOPS + age Secrets Migration from git-crypt
|
||||
|
||||
## Problem
|
||||
git-crypt encrypts entire files — anyone with the key decrypts everything. For multi-user
|
||||
setups where operators should push code without seeing secrets, you need per-value encryption
|
||||
with CI-only decryption.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Single `terraform.tfvars` encrypted with git-crypt containing 100+ secrets
|
||||
- Need to onboard operators who shouldn't see API keys, passwords, SSH keys
|
||||
- Want GitOps (secrets in git) but with access control
|
||||
- Terraform/Terragrunt stack-per-service architecture
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Use JSON, not YAML
|
||||
SOPS outputs the same format as input. `sops -d file.yaml` → YAML. `sops -d file.json` → JSON.
|
||||
Terraform natively supports `*.auto.tfvars.json` files. YAML is NOT valid HCL.
|
||||
|
||||
```
|
||||
secrets.sops.json → sops -d → secrets.auto.tfvars.json → Terraform reads it
|
||||
```
|
||||
|
||||
### 2. Split tfvars into config + secrets
|
||||
```
|
||||
config.tfvars ← plaintext (hostnames, IPs, DNS records)
|
||||
secrets.sops.json ← SOPS-encrypted (passwords, tokens, keys)
|
||||
```
|
||||
|
||||
### 3. Global decrypt, not per-stack hooks
|
||||
**CRITICAL**: Do NOT use `before_hook`/`after_hook` for decryption. With `terragrunt run --all`,
|
||||
70+ stacks run hooks in parallel, all writing to the same output file — race condition.
|
||||
|
||||
Instead, use a wrapper script that decrypts once:
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# scripts/tg — decrypt then terragrunt
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
if [ ! -f "$REPO_ROOT/secrets.auto.tfvars.json" ] || \
|
||||
[ "$REPO_ROOT/secrets.sops.json" -nt "$REPO_ROOT/secrets.auto.tfvars.json" ]; then
|
||||
sops -d "$REPO_ROOT/secrets.sops.json" > "$REPO_ROOT/secrets.auto.tfvars.json"
|
||||
fi
|
||||
exec terragrunt "$@"
|
||||
```
|
||||
|
||||
### 4. Terragrunt loads both (backward compatible)
|
||||
```hcl
|
||||
terraform {
|
||||
extra_arguments "common_vars" {
|
||||
commands = get_terraform_commands_that_need_vars()
|
||||
required_var_files = ["${get_repo_root()}/config.tfvars"]
|
||||
optional_var_files = [
|
||||
"${get_repo_root()}/terraform.tfvars", # legacy (git-crypt)
|
||||
"${get_repo_root()}/secrets.auto.tfvars.json" # new (SOPS)
|
||||
]
|
||||
}
|
||||
before_hook "check_secrets" {
|
||||
commands = ["apply", "plan", "destroy"]
|
||||
execute = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Complex types work in JSON
|
||||
Maps, lists, nested objects, multiline strings (SSH keys as `\n`-escaped) all work:
|
||||
```json
|
||||
{
|
||||
"simple_password": "abc123",
|
||||
"mailserver_accounts": {"user@domain": "pass"},
|
||||
"ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n"
|
||||
}
|
||||
```
|
||||
|
||||
### 6. CI integration (Woodpecker)
|
||||
- Store age private key as CI secret (`SOPS_AGE_KEY`)
|
||||
- Write to temp file for `SOPS_AGE_KEY_FILE` (Woodpecker `from_secret` only does env vars)
|
||||
- `git add stacks/ state/ .woodpecker/` — NEVER `git add .`
|
||||
- Cleanup step with `status: [success, failure]`
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Encrypt
|
||||
sops -e -i secrets.sops.json
|
||||
|
||||
# Decrypt and verify
|
||||
sops -d secrets.sops.json | jq .
|
||||
|
||||
# Verify SSH keys
|
||||
sops -d secrets.sops.json | jq -r '.ssh_key' | ssh-keygen -l -f -
|
||||
|
||||
# Test with terragrunt
|
||||
scripts/tg validate
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Keep git-crypt for binary files (TLS certs, deploy keys) — SOPS can't encrypt binary
|
||||
- `sensitive = true` on all secret variable declarations — prevents plan output leaks
|
||||
- Don't add `sensitive = true` to non-secret variables with "secret" in the name (e.g., `tls_secret_name`, `ingress_path`) — breaks `for_each` on lists
|
||||
- Age keys are one line — much simpler than GPG
|
||||
- `.sops.yaml` path_regex should be anchored: `^secrets\.sops\.json$`
|
||||
|
|
@ -1,97 +0,0 @@
|
|||
---
|
||||
name: terraform-state-identity-mismatch
|
||||
description: |
|
||||
Fix Terraform "Unexpected Identity Change" errors during plan/apply. Use when:
|
||||
(1) Terraform fails with "the Terraform Provider unexpectedly returned a different
|
||||
identity", (2) State refresh shows identity mismatch between stored and current values,
|
||||
(3) Resource was created but terraform apply timed out, leaving state inconsistent.
|
||||
Solution involves removing and reimporting the affected resource.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-01-28
|
||||
---
|
||||
|
||||
# Terraform State Identity Mismatch Fix
|
||||
|
||||
## Problem
|
||||
Terraform fails during plan or apply with an "Unexpected Identity Change" error,
|
||||
indicating the stored state identity doesn't match what the provider returns when
|
||||
reading the resource.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Error message contains: "Unexpected Identity Change: During the read operation,
|
||||
the Terraform Provider unexpectedly returned a different identity"
|
||||
- Often occurs after a terraform apply times out mid-creation
|
||||
- Resource exists in the cluster/cloud but state is corrupted
|
||||
- Common with Kubernetes provider after deployment rollout timeouts
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the affected resource
|
||||
The error message includes the resource address:
|
||||
```
|
||||
with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
|
||||
```
|
||||
|
||||
### Step 2: Remove from state
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
|
||||
```
|
||||
Note: Use single quotes around the address to handle brackets properly.
|
||||
|
||||
### Step 3: Import the resource back
|
||||
```bash
|
||||
terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' <namespace>/<name>
|
||||
```
|
||||
For Kubernetes deployments, the import ID is `namespace/deployment-name`.
|
||||
|
||||
### Step 4: Verify with plan
|
||||
```bash
|
||||
terraform plan -target=<module-path>
|
||||
```
|
||||
Should show minimal or no changes if import was successful.
|
||||
|
||||
### Step 5: Apply to sync any drift
|
||||
```bash
|
||||
terraform apply -target=<module-path>
|
||||
```
|
||||
|
||||
## Verification
|
||||
- `terraform plan` runs without identity errors
|
||||
- `terraform apply` completes successfully
|
||||
- Resource still exists and functions correctly
|
||||
|
||||
## Example
|
||||
**Error:**
|
||||
```
|
||||
Error: Unexpected Identity Change
|
||||
|
||||
Current Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.NullVal...})
|
||||
New Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.StringVal("apps/v1")...})
|
||||
|
||||
with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
|
||||
# Output: Removed ... Successfully removed 1 resource instance(s).
|
||||
|
||||
terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' resume/resume
|
||||
# Output: Import successful!
|
||||
|
||||
terraform apply -target=module.kubernetes_cluster.module.resume -auto-approve
|
||||
# Output: Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is a provider bug, not user error - consider reporting to provider maintainers
|
||||
- The resource continues to work fine; only the terraform state is affected
|
||||
- Always verify the resource exists before importing (don't import non-existent resources)
|
||||
- For Kubernetes resources, import IDs are typically `namespace/name`
|
||||
- For AWS resources, import IDs vary by resource type (check provider docs)
|
||||
- Consider adding `-lock=false` if state locking causes issues during recovery
|
||||
|
||||
## See Also
|
||||
- Terraform state management documentation
|
||||
- Kubernetes provider import documentation
|
||||
|
|
@ -1,405 +0,0 @@
|
|||
---
|
||||
name: traefik-helm-configuration
|
||||
description: |
|
||||
Consolidated Traefik Helm chart configuration skill covering HTTP/3 (QUIC), UDP
|
||||
cross-namespace routing, and plugin download failures. Use when:
|
||||
(1) enabling HTTP/3 on Traefik or Alt-Svc header shows wrong port (e.g., 8443 instead of 443),
|
||||
(2) HTTP/3 is configured in Helm values but not working end-to-end,
|
||||
(3) Cloudflare-proxied domains need HTTP/3 enabled,
|
||||
(4) custom UDP entrypoints don't appear in the LoadBalancer Service,
|
||||
(5) IngressRouteUDP logs show "udp service is not in the parent resource namespace",
|
||||
(6) DNS or other UDP traffic through Traefik times out despite correct IngressRouteUDP config,
|
||||
(7) all Traefik routes suddenly return 404 after a restart or pod recreation,
|
||||
(8) Traefik logs show "Plugins are disabled because an error has occurred",
|
||||
(9) plugin download fails with "context deadline exceeded" for crowdsec-bouncer or rewrite-body.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Traefik Helm Chart Configuration
|
||||
|
||||
Consolidated guide for three common Traefik Helm chart issues: HTTP/3 (QUIC) enablement,
|
||||
UDP cross-namespace routing, and plugin download failures causing global 404s.
|
||||
|
||||
---
|
||||
|
||||
## HTTP/3 (QUIC)
|
||||
|
||||
### Problem
|
||||
|
||||
You want to enable HTTP/3 (QUIC) on a Traefik ingress controller in Kubernetes so that
|
||||
clients can negotiate HTTP/3 connections via the `Alt-Svc` response header.
|
||||
|
||||
### Context / When to Use
|
||||
|
||||
- Enabling HTTP/3 for the first time on Traefik
|
||||
- Troubleshooting HTTP/3 not working despite configuration
|
||||
- Alt-Svc header shows internal container port (8443) instead of external port (443)
|
||||
- Need to enable HTTP/3 on both origin (Traefik) and CDN (Cloudflare)
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Configure Traefik Helm Chart Values
|
||||
|
||||
In the Traefik Helm release values, add `http3` configuration to the `websecure` entrypoint:
|
||||
|
||||
```hcl
|
||||
# In modules/kubernetes/traefik/main.tf
|
||||
ports = {
|
||||
websecure = {
|
||||
port = 8443
|
||||
exposedPort = 443
|
||||
protocol = "TCP"
|
||||
http = {
|
||||
tls = {
|
||||
enabled = true
|
||||
}
|
||||
}
|
||||
# Enable HTTP/3 (QUIC)
|
||||
http3 = {
|
||||
enabled = true
|
||||
advertisedPort = 443 # CRITICAL: Must match the external port
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Key gotcha: `advertisedPort = 443`**
|
||||
|
||||
Without `advertisedPort`, Traefik advertises the *internal container port* (8443) in the
|
||||
`Alt-Svc` header:
|
||||
```
|
||||
Alt-Svc: h3=":8443"; ma=2592000
|
||||
```
|
||||
|
||||
This is wrong because clients connect on external port 443, not 8443. The correct header is:
|
||||
```
|
||||
Alt-Svc: h3=":443"; ma=2592000
|
||||
```
|
||||
|
||||
Setting `advertisedPort = 443` fixes this.
|
||||
|
||||
#### Step 2: Ensure Helm Chart Fully Re-renders
|
||||
|
||||
Changing `http3.enabled=true` in values alone may not cause the Helm chart to add the
|
||||
required UDP port to the Service and Deployment specs. The Traefik Helm chart templates
|
||||
need to re-render to include `websecure-http3: 443/UDP` in the Service.
|
||||
|
||||
If the Service doesn't show a UDP port after applying:
|
||||
- See the companion skill `helm-release-force-rerender` for fixing this
|
||||
- The root cause is that `helm upgrade --reuse-values` (Terraform's default behavior)
|
||||
may not trigger template re-rendering for structural changes like adding new ports
|
||||
|
||||
After a successful apply, verify the Service has the UDP port:
|
||||
```bash
|
||||
kubectl get svc traefik -n traefik -o yaml | grep -A5 "443"
|
||||
```
|
||||
|
||||
Expected output should include both:
|
||||
```yaml
|
||||
- name: websecure
|
||||
port: 443
|
||||
protocol: TCP
|
||||
targetPort: websecure
|
||||
- name: websecure-http3
|
||||
port: 443
|
||||
protocol: UDP
|
||||
targetPort: websecure-http3
|
||||
```
|
||||
|
||||
#### Step 3: Enable HTTP/3 on Cloudflare (if using Cloudflare proxy)
|
||||
|
||||
For Cloudflare-proxied domains, HTTP/3 must also be enabled at the Cloudflare zone level.
|
||||
|
||||
**Cloudflare Provider v4** (current in this repo):
|
||||
```hcl
|
||||
resource "cloudflare_zone_settings_override" "http3" {
|
||||
zone_id = var.cloudflare_zone_id
|
||||
|
||||
settings {
|
||||
http3 = "on" # String values: "on" or "off"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Note**: In Cloudflare provider v5, this uses `cloudflare_zone_setting` (singular) with
|
||||
different syntax. The v4 resource is `cloudflare_zone_settings_override` (plural + override).
|
||||
|
||||
#### Step 4: Verify End-to-End
|
||||
|
||||
##### Testing from macOS
|
||||
|
||||
macOS system curl does NOT support HTTP/3. Install curl with HTTP/3:
|
||||
```bash
|
||||
brew install curl
|
||||
```
|
||||
|
||||
Then use the Homebrew version explicitly:
|
||||
```bash
|
||||
# Test HTTP/3 negotiation (Alt-Svc header)
|
||||
/opt/homebrew/opt/curl/bin/curl -sI https://example.viktorbarzin.me 2>&1 | grep -i alt-svc
|
||||
# Expected: alt-svc: h3=":443"; ma=2592000
|
||||
|
||||
# Test actual HTTP/3 connection
|
||||
/opt/homebrew/opt/curl/bin/curl --http3-only -sI https://example.viktorbarzin.me
|
||||
# Expected: HTTP/3 200
|
||||
```
|
||||
|
||||
##### Testing from within the Cluster
|
||||
|
||||
```bash
|
||||
# Use a curl image with HTTP/3 support (amd64 only)
|
||||
kubectl run curl-h3 --rm -it --image=ymuski/curl-http3 --restart=Never -- \
|
||||
curl --http3-only -sI https://example.viktorbarzin.me
|
||||
|
||||
# Note: ymuski/curl-http3 is amd64-only; it will fail on arm64 nodes
|
||||
```
|
||||
|
||||
##### Checking Traefik Logs
|
||||
|
||||
```bash
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=100 | grep -i quic
|
||||
```
|
||||
|
||||
### Verification Checklist
|
||||
|
||||
1. Traefik Service shows UDP port 443 (`websecure-http3`)
|
||||
2. `Alt-Svc` response header shows `h3=":443"` (not `h3=":8443"`)
|
||||
3. `/opt/homebrew/opt/curl/bin/curl --http3-only` successfully connects
|
||||
4. Cloudflare zone has HTTP/3 enabled (for proxied domains)
|
||||
|
||||
### Current Configuration (This Repo)
|
||||
|
||||
- **Traefik config**: `modules/kubernetes/traefik/main.tf` (lines 89-92)
|
||||
- **Cloudflare HTTP/3**: `modules/kubernetes/cloudflared/cloudflare.tf` (line 153)
|
||||
- **MetalLB IP**: 10.0.20.202 (Traefik LoadBalancer service)
|
||||
|
||||
### Notes
|
||||
|
||||
- HTTP/3 uses QUIC over UDP. Firewalls must allow UDP 443 inbound.
|
||||
- Traefik automatically handles TLS for HTTP/3 using the same certs as HTTPS.
|
||||
- The `Alt-Svc` header is sent on HTTP/2 responses to tell clients HTTP/3 is available.
|
||||
Clients then upgrade to HTTP/3 on subsequent requests.
|
||||
- For non-Cloudflare (direct DNS) domains, only the Traefik-side config is needed.
|
||||
- Cloudflare handles its own HTTP/3 negotiation with end users; the origin connection
|
||||
between Cloudflare and Traefik uses HTTP/1.1 or HTTP/2 (not HTTP/3).
|
||||
|
||||
---
|
||||
|
||||
## UDP Cross-Namespace Routing
|
||||
|
||||
### Problem
|
||||
|
||||
Adding a custom UDP entrypoint (e.g., DNS on port 53) to Traefik v3 via Helm chart values
|
||||
doesn't work out of the box. Traffic times out even though the Traefik pod listens on the
|
||||
port internally. Two separate issues compound:
|
||||
|
||||
1. The Helm chart defaults `expose` to `false` for custom entrypoints -- the port is never
|
||||
added to the LoadBalancer Service
|
||||
2. `allowCrossNamespace` defaults to `false` -- IngressRouteUDP in namespace A can't
|
||||
reference a Service in namespace B
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- Traefik Helm chart v39.0.0+ (Traefik v3.x)
|
||||
- Custom UDP entrypoint defined in `ports` values
|
||||
- `IngressRouteUDP` referencing a service in a different namespace
|
||||
- Symptoms:
|
||||
- `kubectl get svc traefik` doesn't show your custom UDP port
|
||||
- UDP traffic to the LoadBalancer IP times out
|
||||
- Traefik logs show: `"udp service <namespace>/<service> is not in the parent resource namespace <traefik-namespace>"`
|
||||
- `netstat -ulnp` inside Traefik pod confirms it IS listening on the port
|
||||
|
||||
### Solution
|
||||
|
||||
#### Fix 1: Expose the UDP port on the Service
|
||||
|
||||
In the Helm values, add `expose = { default = true }` to the entrypoint:
|
||||
|
||||
```hcl
|
||||
# Terraform HCL
|
||||
ports = {
|
||||
dns-udp = {
|
||||
port = 5353
|
||||
exposedPort = 53
|
||||
protocol = "UDP"
|
||||
expose = { default = true } # <-- Required for custom entrypoints
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Helm values YAML equivalent
|
||||
ports:
|
||||
dns-udp:
|
||||
port: 5353
|
||||
exposedPort: 53
|
||||
protocol: UDP
|
||||
expose:
|
||||
default: true
|
||||
```
|
||||
|
||||
Note: The built-in `web` and `websecure` entrypoints have `expose.default = true` by
|
||||
default, but custom entrypoints do NOT.
|
||||
|
||||
#### Fix 2: Enable cross-namespace CRD references
|
||||
|
||||
In the Helm values, add `allowCrossNamespace = true` to the kubernetesCRD provider:
|
||||
|
||||
```hcl
|
||||
# Terraform HCL
|
||||
providers = {
|
||||
kubernetesCRD = {
|
||||
enabled = true
|
||||
allowCrossNamespace = true # <-- Required for cross-namespace IngressRouteUDP
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Helm values YAML
|
||||
providers:
|
||||
kubernetesCRD:
|
||||
enabled: true
|
||||
allowCrossNamespace: true
|
||||
```
|
||||
|
||||
This is required whenever an `IngressRouteUDP` (or `IngressRouteTCP`, `IngressRoute`)
|
||||
references a Kubernetes Service in a different namespace.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# 1. Verify the port appears in the Service
|
||||
kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[*].name}'
|
||||
# Should include your custom entrypoint name (e.g., "dns-udp")
|
||||
|
||||
# 2. Check Traefik logs for cross-namespace errors
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "not in the parent resource namespace"
|
||||
# Should return nothing after the fix
|
||||
|
||||
# 3. Test the UDP service
|
||||
dig @<traefik-lb-ip> example.com
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
DNS forwarding through Traefik to Technitium DNS:
|
||||
- IngressRouteUDP in `traefik` namespace routes `dns-udp` entrypoint to
|
||||
`technitium-dns:53` in `technitium` namespace
|
||||
- Without Fix 1: port 53 never exposed on LoadBalancer -- traffic can't reach Traefik
|
||||
- Without Fix 2: Traefik rejects the route -- logs error every ~60 seconds
|
||||
- With both fixes: DNS queries to LoadBalancer IP:53 -> Traefik -> Technitium
|
||||
|
||||
### Notes
|
||||
|
||||
1. **Debugging order matters**: Fix 1 (expose) must come first. Without the port on the
|
||||
Service, you can't even test if the routing works. Fix 2 (cross-namespace) errors only
|
||||
appear in Traefik logs, not as user-visible failures.
|
||||
2. **`allowCrossNamespace` is a security consideration**: It allows any IngressRoute CRD
|
||||
to reference services in any namespace. If this is too broad, consider using
|
||||
`TraefikService` middleware or moving the IngressRouteUDP to the target namespace.
|
||||
3. **Rolling update**: Changing `allowCrossNamespace` triggers a Traefik pod restart
|
||||
(new CLI args). Changing `expose` only updates the Service (no pod restart needed).
|
||||
4. **This applies to TCP too**: `IngressRouteTCP` with cross-namespace services needs the
|
||||
same `allowCrossNamespace` setting.
|
||||
|
||||
---
|
||||
|
||||
## Plugin Download Failure (Global 404)
|
||||
|
||||
### Problem
|
||||
|
||||
After a node maintenance operation (containerd restart, node drain/uncordon, etc.),
|
||||
all Traefik-managed routes return 404. Services, Ingresses, and Middlewares all exist
|
||||
and look correct, making this extremely confusing to debug.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- ALL Traefik routes return 404 simultaneously (not just one service)
|
||||
- Traefik pods are Running and Ready
|
||||
- Ingress resources exist with correct annotations
|
||||
- Middlewares exist in the correct namespaces
|
||||
- TLS secrets exist
|
||||
- Traefik startup logs contain: `Plugins are disabled because an error has occurred`
|
||||
- Plugin download error: `unable to download plugin ... context deadline exceeded`
|
||||
- Happened after a node restart, containerd restart, or network disruption
|
||||
|
||||
### Root Cause
|
||||
|
||||
Traefik downloads plugins (crowdsec-bouncer, rewrite-body, etc.) from
|
||||
`plugins.traefik.io` on **every pod startup**. If the download fails (network
|
||||
unreachable, DNS not ready, timeout), Traefik **disables ALL plugins entirely**.
|
||||
|
||||
Since the `crowdsec` middleware is a plugin-based middleware referenced in virtually
|
||||
every Ingress annotation (`traefik-crowdsec@kubernetescrd`), Traefik treats the
|
||||
missing plugin middleware as a fatal routing error and returns 404 for every route
|
||||
that references it -- which is typically all of them.
|
||||
|
||||
### Solution
|
||||
|
||||
```bash
|
||||
# 1. Confirm the diagnosis - check Traefik startup logs
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | head -20
|
||||
# Look for: "Plugins are disabled because an error has occurred"
|
||||
|
||||
# 2. Verify outbound connectivity is restored
|
||||
kubectl exec -n traefik $(kubectl get pods -n traefik -l app.kubernetes.io/name=traefik \
|
||||
-o jsonpath='{.items[0].metadata.name}') -- wget -q -O- --timeout=5 https://plugins.traefik.io
|
||||
|
||||
# 3. Rollout restart to retry plugin download
|
||||
kubectl rollout restart deployment -n traefik traefik
|
||||
|
||||
# 4. Verify plugins loaded
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "Plugins"
|
||||
# Should show: "Plugins loaded."
|
||||
|
||||
# 5. Verify routes work
|
||||
curl -s -o /dev/null -w "%{http_code}" -H "Host: viktorbarzin.me" https://10.0.20.202 -k
|
||||
# Should return 200 instead of 404
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
- Traefik logs show `Plugins loaded.` (not `Plugins are disabled`)
|
||||
- Routes return expected HTTP status codes (200, 302, etc.) instead of 404
|
||||
- `kubectl logs -n traefik <pod> | grep "does not exist"` shows no middleware errors
|
||||
|
||||
### Why This Is Hard to Debug
|
||||
|
||||
1. **Traefik pods show Running/Ready** -- health checks pass even without plugins
|
||||
2. **All Kubernetes resources look correct** -- Ingresses, Services, Middlewares all exist
|
||||
3. **The error is in startup logs only** -- not in per-request logs (requests just get 404)
|
||||
4. **The 404 is Traefik's default** -- same as "no route matched", not a backend error
|
||||
5. **The middleware error is logged once at startup** -- easy to miss in a stream of logs
|
||||
|
||||
### Prevention
|
||||
|
||||
- During planned maintenance (node drain, containerd restart), restart Traefik pods
|
||||
AFTER network connectivity is confirmed restored
|
||||
- Consider pre-caching Traefik plugins in the container image or using an init container
|
||||
- Monitor for the `Plugins are disabled` log message in your alerting system
|
||||
|
||||
### Notes
|
||||
|
||||
- This affects ALL plugin-based middlewares, not just crowdsec
|
||||
- The `rewrite-body` plugin (used for rybbit analytics injection) is also affected
|
||||
- Traefik v3.x downloads plugins on every startup; there is no persistent cache
|
||||
- If only some routes return 404, the problem is likely different (missing middleware
|
||||
or TLS secret, not a plugin issue)
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Traefik HTTP/3 Documentation](https://doc.traefik.io/traefik/routing/entrypoints/#http3)
|
||||
- [Traefik Helm Chart Values](https://github.com/traefik/traefik-helm-chart/blob/master/traefik/values.yaml)
|
||||
- [Cloudflare HTTP/3 Settings](https://developers.cloudflare.com/speed/optimization/protocol/http3/)
|
||||
- [Traefik Helm Chart Ports Configuration](https://github.com/traefik/traefik-helm-chart)
|
||||
- [Traefik v3 Providers Documentation](https://doc.traefik.io/traefik/providers/kubernetes-crd/)
|
||||
|
||||
## See Also
|
||||
|
||||
- `traefik-rewrite-body-troubleshooting` -- Traefik rewrite-body plugin troubleshooting (compression, Accept header issues)
|
||||
- `helm-release-force-rerender` -- Force Helm chart re-render when structural changes don't take effect
|
||||
|
|
@ -1,200 +0,0 @@
|
|||
---
|
||||
name: traefik-rewrite-body-troubleshooting
|
||||
description: |
|
||||
Troubleshooting guide for the Traefik rewrite-body plugin (packruler/rewrite-body).
|
||||
Covers two failure modes: (1) Compression failure — plugin logs "flate: corrupt input
|
||||
before offset 5" when backends send gzip-compressed responses, corrupting response
|
||||
bodies and breaking WebSocket connections, authentication flows, and mobile app
|
||||
connectivity. (2) Silent skip — plugin silently skips content injection (rybbit
|
||||
analytics, trap links, or any HTML rewriting) when the request Accept header doesn't
|
||||
contain "text/html" (e.g., curl's default Accept: */*), making it appear broken
|
||||
despite correct configuration.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Traefik Rewrite-Body Plugin Troubleshooting
|
||||
|
||||
Two distinct failure modes for the `packruler/rewrite-body` Traefik plugin used for
|
||||
injecting analytics scripts (rybbit) and anti-AI trap links into HTML responses.
|
||||
|
||||
---
|
||||
|
||||
## Problem 1: Compression Failure
|
||||
|
||||
### Symptoms
|
||||
- Traefik logs show: `Rewrite-Body | ERROR ... Error loading content: flate: corrupt input before offset 5`
|
||||
- Mobile apps (e.g., Home Assistant Companion) fail while browser works
|
||||
- HA Companion app shows repeated `GET /?external_auth=1` requests (auth loop)
|
||||
- WebSocket connections (`/api/websocket`) are very short-lived (seconds instead of minutes)
|
||||
- HTTP 499 errors on API calls (client disconnects due to corrupted responses)
|
||||
- Using `packruler/rewrite-body` plugin v1.2.0 with `monitoring.types = ["text/html"]`
|
||||
|
||||
### Root Cause
|
||||
Despite the `monitoring.types = ["text/html"]` filter, the plugin attempts to decompress
|
||||
ALL responses before checking content type. When decompression fails on certain gzip
|
||||
encodings, it corrupts the response body, breaking:
|
||||
- WebSocket upgrade handshakes
|
||||
- Authentication flows (HA Companion app's `external_auth` callback)
|
||||
- Mobile app connectivity (while browser appears to work due to auto-reconnect)
|
||||
|
||||
### Misleading Symptoms
|
||||
- HTTP/3 (QUIC) may appear to be the cause because HTTP/3 requests show 499 errors.
|
||||
This is a red herring -- the rewrite-body plugin corruption affects all protocols.
|
||||
- WebSocket issues may look like a timeout or proxy configuration problem.
|
||||
- The `monitoring.types = ["text/html"]` config suggests the plugin should only touch
|
||||
HTML, but it still processes all responses for decompression before filtering.
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Create a strip-accept-encoding middleware
|
||||
Add a Traefik middleware that removes `Accept-Encoding` from requests, forcing
|
||||
backends to send uncompressed responses that the plugin can safely process:
|
||||
|
||||
```hcl
|
||||
# In traefik/middleware.tf
|
||||
resource "kubernetes_manifest" "middleware_strip_accept_encoding" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "strip-accept-encoding"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
headers = {
|
||||
customRequestHeaders = {
|
||||
"Accept-Encoding" = ""
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
```
|
||||
|
||||
#### Step 2: Add middleware to routes with rewrite-body
|
||||
In the ingress factory middleware chain, add `strip-accept-encoding` BEFORE the
|
||||
rewrite-body middleware:
|
||||
|
||||
```hcl
|
||||
var.rybbit_site_id != null ? "traefik-strip-accept-encoding@kubernetescrd" : null,
|
||||
var.rybbit_site_id != null ? "${var.namespace}-rybbit-analytics-${var.name}@kubernetescrd" : null,
|
||||
```
|
||||
|
||||
The order matters: strip-accept-encoding must come first so the request reaches
|
||||
the backend without Accept-Encoding, and the uncompressed response then passes
|
||||
through the rewrite-body plugin.
|
||||
|
||||
### Verification (Compression Fix)
|
||||
1. Check Traefik logs for absence of `flate: corrupt input` errors:
|
||||
```bash
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=200 | grep -i "flate\|rewrite-body"
|
||||
```
|
||||
2. Verify the middleware chain includes strip-accept-encoding before rybbit:
|
||||
```bash
|
||||
kubectl get ingress -n <namespace> <name> -o jsonpath='{.metadata.annotations.traefik\.ingress\.kubernetes\.io/router\.middlewares}'
|
||||
```
|
||||
3. Test mobile app connectivity (HA Companion, etc.)
|
||||
|
||||
### Notes (Compression)
|
||||
- This affects ALL services using the rewrite-body plugin, not just HA
|
||||
- The fix is applied conditionally: `strip-accept-encoding` is only added to the
|
||||
middleware chain when `rybbit_site_id` is set, so services without analytics
|
||||
are unaffected
|
||||
- Both `ingress_factory` and `reverse_proxy/factory` modules need the fix
|
||||
- Traefik may still compress responses to clients via its own compression middleware;
|
||||
the strip only affects the backend request
|
||||
- The plugin's `monitoring.types` filter works for deciding what to rewrite, but
|
||||
decompression is attempted on all responses regardless
|
||||
|
||||
---
|
||||
|
||||
## Problem 2: Silent Skip (Accept Header Mismatch)
|
||||
|
||||
### Symptoms
|
||||
- rewrite-body middleware is in the ingress middleware chain and shows status "enabled" in Traefik API
|
||||
- `curl https://example.com/` returns original HTML with no injected content
|
||||
- Browser shows injected content (rybbit script, trap links, etc.)
|
||||
- No errors in Traefik logs -- the plugin silently skips processing
|
||||
- `monitoring.types = ["text/html"]` is configured in the middleware spec
|
||||
- Middleware chain order is correct (strip-accept-encoding before rewrite-body)
|
||||
|
||||
### Root Cause
|
||||
In the plugin source code, `SupportsProcessing()` checks the **request** `Accept`
|
||||
header (not the response `Content-Type`) against `monitoring.types`:
|
||||
|
||||
```go
|
||||
func (r *Rewriter) SupportsProcessing(req *http.Request) bool {
|
||||
accept := req.Header.Get("Accept")
|
||||
for _, monitoringType := range r.monitoring.Types {
|
||||
if strings.Contains(accept, monitoringType) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
```
|
||||
|
||||
It uses `strings.Contains(accept, "text/html")`. The curl default `Accept: */*` does
|
||||
NOT contain the substring `text/html`, so the plugin returns false and skips all
|
||||
processing. Browser requests include `Accept: text/html,application/xhtml+xml,...`
|
||||
which does match.
|
||||
|
||||
### Misleading Symptoms
|
||||
- Appears as if the middleware isn't working at all
|
||||
- May look like a middleware ordering issue or configuration error
|
||||
- `kubectl get middleware` shows the resource exists with correct spec
|
||||
- Traefik API (`/api/http/middlewares/`) shows the middleware as "enabled"
|
||||
- Checking the rewrite-body regex patterns seems pointless since nothing is being processed
|
||||
|
||||
### Solution
|
||||
This is **working as designed** -- not a bug. The fix depends on context:
|
||||
|
||||
#### For testing with curl
|
||||
Add the `Accept` header to simulate a browser:
|
||||
```bash
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/
|
||||
```
|
||||
|
||||
#### For verifying injection is working
|
||||
```bash
|
||||
# Check for injected content (trap links, analytics, etc.)
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
|
||||
| grep -oE 'href="https://poison[^"]*"'
|
||||
|
||||
# Check for rybbit analytics
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
|
||||
| grep -oE 'src="https://rybbit[^"]*"'
|
||||
```
|
||||
|
||||
#### For programmatic clients that need injection
|
||||
If a non-browser client needs to receive injected content, ensure it sends
|
||||
`Accept: text/html` in its request headers.
|
||||
|
||||
### Verification (Accept Header)
|
||||
```bash
|
||||
# Without Accept header -- no injection (expected)
|
||||
curl -s https://example.com/ | grep -c "rybbit"
|
||||
# Output: 0
|
||||
|
||||
# With Accept header -- injection works
|
||||
curl -s -H "Accept: text/html" https://example.com/ | grep -c "rybbit"
|
||||
# Output: 1
|
||||
```
|
||||
|
||||
### Notes (Accept Header)
|
||||
- This behavior is independent of the compression issue (Problem 1 above)
|
||||
- The check is on the **request** `Accept` header, not the **response** `Content-Type`
|
||||
- `Accept: */*` does NOT match -- `strings.Contains("*/*", "text/html")` is false
|
||||
- Real AI scrapers typically send browser-like Accept headers, so trap links will be
|
||||
injected for them correctly
|
||||
- API calls (which typically send `Accept: application/json`) are correctly skipped
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
- `traefik-helm-configuration` -- Traefik Helm chart configuration and entrypoints
|
||||
- `ingress-factory-migration` -- Covers the ingress factory module that creates
|
||||
rybbit analytics middlewares
|
||||
|
|
@ -1,454 +0,0 @@
|
|||
---
|
||||
name: cluster-health
|
||||
description: |
|
||||
Check Kubernetes cluster health and fix common issues. Use when:
|
||||
(1) User asks to check the cluster, check health, or "what's wrong",
|
||||
(2) User asks about pod status, node health, or deployment issues,
|
||||
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
|
||||
(4) User mentions "health check", "cluster status", "cluster health",
|
||||
(5) User asks "is everything running" or "any problems".
|
||||
Runs 47 cluster-wide checks (nodes, workloads, monitoring, certs,
|
||||
backups, external reachability, PVE host thermals + load, HA Sofia
|
||||
status dashboard, Immich smart-search, Proxmox CSI ghost-disk drift)
|
||||
with safe auto-fix for evicted pods.
|
||||
author: Claude Code
|
||||
version: 2.0.0
|
||||
date: 2026-04-19
|
||||
---
|
||||
|
||||
# Cluster Health Check
|
||||
|
||||
## MANDATORY: Run the script first
|
||||
|
||||
When this skill is invoked, your **first action** must be to run the
|
||||
cluster health check script and reason over its output before doing
|
||||
anything else. Do not improvise individual `kubectl` calls — the
|
||||
script is the authoritative surface.
|
||||
|
||||
```bash
|
||||
cd /home/wizard/code
|
||||
bash infra/scripts/cluster_healthcheck.sh --json | tee /tmp/cluster-health.json
|
||||
```
|
||||
|
||||
If the session is rooted elsewhere, fall back to the absolute path:
|
||||
|
||||
```bash
|
||||
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --json
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
1. Parse the JSON. Report the PASS/WARN/FAIL counts + overall verdict.
|
||||
2. Iterate every FAIL and WARN check, describe what tripped, and propose
|
||||
the remediation path (use the recipes below).
|
||||
3. Only reach for ad-hoc `kubectl` commands when investigating a
|
||||
specific failure beyond what the script reported.
|
||||
|
||||
Exit codes: `0` = healthy, `1` = warnings only, `2` = failures.
|
||||
|
||||
## Quick flags
|
||||
|
||||
```bash
|
||||
# Human-readable report (default), no auto-fix
|
||||
bash infra/scripts/cluster_healthcheck.sh
|
||||
|
||||
# Machine-readable JSON summary
|
||||
bash infra/scripts/cluster_healthcheck.sh --json
|
||||
|
||||
# Only show WARN + FAIL (suppress PASS noise)
|
||||
bash infra/scripts/cluster_healthcheck.sh --quiet
|
||||
|
||||
# Enable auto-fix (delete evicted pods, kick stuck CrashLoop pods)
|
||||
bash infra/scripts/cluster_healthcheck.sh --fix
|
||||
|
||||
# Combined: quiet JSON without auto-fix
|
||||
bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
|
||||
|
||||
# Custom kubeconfig
|
||||
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
|
||||
```
|
||||
|
||||
## What It Checks (47 checks)
|
||||
|
||||
| # | Check | Notes |
|
||||
|---|-------|-------|
|
||||
| 1 | Node Status | NotReady nodes, version drift |
|
||||
| 2 | Node Resources | CPU/mem >80% (warn) / >90% (fail) |
|
||||
| 3 | Node Conditions | MemoryPressure / DiskPressure / PIDPressure |
|
||||
| 4 | Problematic Pods | CrashLoopBackOff / Error / ImagePullBackOff |
|
||||
| 5 | Evicted/Failed Pods | `status.phase=Failed` |
|
||||
| 6 | DaemonSets | desired == ready |
|
||||
| 7 | Deployments | ready == desired replicas |
|
||||
| 8 | PVC Status | all Bound |
|
||||
| 9 | HPA Health | targets not `<unknown>`, utilization <100% |
|
||||
| 10 | CronJob Failures | job conditions `Failed=True` in last 24h |
|
||||
| 11 | CrowdSec Agents | all pods Running |
|
||||
| 12 | Ingress Routes | every ingress has an LB IP + Traefik LB |
|
||||
| 13 | Prometheus Alerts | count of firing alerts |
|
||||
| 14 | Uptime Kuma Monitors | internal + external monitors up |
|
||||
| 15 | ResourceQuota Pressure | any quota >80% used |
|
||||
| 16 | StatefulSets | ready == desired |
|
||||
| 17 | Node Disk Usage | ephemeral-storage <80% |
|
||||
| 18 | Helm Release Health | all `deployed` (no `pending-*`) |
|
||||
| 19 | Kyverno Policy Engine | all pods Running |
|
||||
| 20 | NFS Connectivity | 192.168.1.127 showmount / port 2049 |
|
||||
| 21 | DNS Resolution | Technitium resolves internal + external |
|
||||
| 22 | TLS Certificate Expiry | TLS `Secret` certs >30d valid |
|
||||
| 23 | GPU Health | nvidia namespace + device-plugin Running |
|
||||
| 24 | Cloudflare Tunnel | pods Running |
|
||||
| 25 | Resource Usage | node CPU/mem headroom |
|
||||
| 26 | HA Sofia — Entity Availability | Home Assistant unavailable/unknown count |
|
||||
| 27 | HA Sofia — Integration Health | config entries setup_error / not_loaded |
|
||||
| 28 | HA Sofia — Automation Status | disabled / stale (>30d) automations |
|
||||
| 29 | HA Sofia — System Resources | HA CPU / mem / disk |
|
||||
| 30 | Hardware Exporters | snmp / idrac-redfish / proxmox / tuya pods + scrapes |
|
||||
| 31 | cert-manager — Certificate Readiness | Certificate CRs with `Ready!=True` |
|
||||
| 32 | cert-manager — Certificate Expiry (<14d) | notAfter within 14d |
|
||||
| 33 | cert-manager — Failed CertificateRequests | `Ready=False, reason=Failed` |
|
||||
| 34 | Backup Freshness — Per-DB Dumps | MySQL + PG dumps within 25h |
|
||||
| 35 | Backup Freshness — Offsite Sync | Pushgateway `backup_last_success_timestamp` <27h |
|
||||
| 36 | Backup Freshness — LVM PVC Snapshots | newest thin snapshot <25h (SSH PVE) |
|
||||
| 37 | Monitoring — Prometheus + Alertmanager | `/-/ready` + AM pods Running |
|
||||
| 38 | Monitoring — Vault Sealed Status | `vault status` reports `Sealed: false` |
|
||||
| 39 | Monitoring — ClusterSecretStore Ready | `vault-kv` + `vault-database` Ready |
|
||||
| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
|
||||
| 41 | External — ExternalAccessDivergence Alert | alert not firing |
|
||||
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
|
||||
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) |
|
||||
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads |
|
||||
| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
|
||||
| 46 | Immich Smart Search | `clip_index` residency in PG `shared_buffers` + representative ANN probe latency (in immich-postgresql). FAIL >1.5s or <50% resident; WARN >0.5s or <90% resident. Cold cache → check `clip-index-prewarm` CronJob |
|
||||
| 47 | Proxmox CSI — Ghost-Disk Drift | Per node, compares real virtio-scsi CSI disks in `qm config <vmid>` (SSH PVE) vs attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (`query-pci` QMP timeouts) that the scheduler's 28-LUN guard can't see. PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near LUN cap → imminent wedge). Cleanup: detach ghosts via `qm set <vmid> --delete scsiN` (frees slot, retains LV) |
|
||||
|
||||
## Safe Auto-Fix Rules
|
||||
|
||||
`--fix` only performs operations that are genuinely reversible and
|
||||
observable. Nothing here rewrites Terraform state or mutates the cluster
|
||||
beyond "delete pod".
|
||||
|
||||
### Done automatically by `--fix`
|
||||
|
||||
- **Evicted / Failed pods** — delete them; the controller recreates.
|
||||
```bash
|
||||
kubectl delete pods -A --field-selector=status.phase=Failed
|
||||
```
|
||||
- **CrashLoopBackOff pods with >10 restarts** — delete once to reset
|
||||
backoff timer.
|
||||
|
||||
### NEVER auto-fix (requires human investigation)
|
||||
|
||||
- NotReady nodes
|
||||
- MemoryPressure / DiskPressure / PIDPressure
|
||||
- ImagePullBackOff (usually a bad tag / registry credential)
|
||||
- Deployment ready-replica mismatch
|
||||
- Pending PVCs
|
||||
- Node CPU/memory >90%
|
||||
- CronJob failures
|
||||
- DaemonSet desired != ready
|
||||
- Vault sealed
|
||||
- ClusterSecretStore not Ready
|
||||
- cert-manager Certificate failures
|
||||
- Backup freshness regressions
|
||||
- Any external-reachability failure
|
||||
|
||||
## Deep-investigation recipes per failure mode
|
||||
|
||||
### Node Issues (checks 1, 3, 17, 25)
|
||||
|
||||
```bash
|
||||
kubectl describe node <node>
|
||||
kubectl top nodes
|
||||
kubectl get events --field-selector involvedObject.name=<node> --sort-by='.lastTimestamp'
|
||||
# SSH to the node
|
||||
ssh root@10.0.20.10X
|
||||
systemctl status kubelet
|
||||
journalctl -u kubelet --since "30 minutes ago" | tail -100
|
||||
df -h ; free -h
|
||||
```
|
||||
|
||||
Node IPs: `10.0.20.100` master, `.101` node1 (GPU), `.102` node2,
|
||||
`.103` node3, `.104` node4.
|
||||
|
||||
### Pod Issues (checks 4, 5, 11, 19)
|
||||
|
||||
```bash
|
||||
kubectl describe pod -n <ns> <pod>
|
||||
kubectl logs -n <ns> <pod> --tail=200
|
||||
kubectl logs -n <ns> <pod> --previous --tail=200
|
||||
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
Common failure causes: OOMKilled (raise mem limit in Terraform), bad
|
||||
config / missing env var, DB connection failure (check `dbaas` pods),
|
||||
NFS mount failure (`showmount -e 192.168.1.127`), stale
|
||||
imagePullSecret.
|
||||
|
||||
### Deployment / StatefulSet / DaemonSet (checks 6, 7, 16)
|
||||
|
||||
```bash
|
||||
kubectl describe deployment -n <ns> <name>
|
||||
kubectl rollout status deployment -n <ns> <name>
|
||||
kubectl rollout history deployment -n <ns> <name>
|
||||
kubectl get rs -n <ns> -l app=<app>
|
||||
```
|
||||
|
||||
### PVC (check 8)
|
||||
|
||||
```bash
|
||||
kubectl describe pvc -n <ns> <pvc>
|
||||
kubectl get events -n <ns> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
|
||||
kubectl get pv | grep <pvc>
|
||||
showmount -e 192.168.1.127
|
||||
```
|
||||
|
||||
### cert-manager (checks 31, 32, 33)
|
||||
|
||||
```bash
|
||||
kubectl get certificate -A
|
||||
kubectl describe certificate -n <ns> <name>
|
||||
kubectl get certificaterequest -A
|
||||
kubectl describe certificaterequest -n <ns> <name>
|
||||
kubectl logs -n cert-manager deploy/cert-manager | tail -50
|
||||
```
|
||||
|
||||
Common causes: ACME HTTP-01 challenge blocked, ClusterIssuer missing
|
||||
DNS provider secret, rate-limit from Let's Encrypt.
|
||||
|
||||
### Backups (checks 34, 35, 36)
|
||||
|
||||
```bash
|
||||
# Per-DB dumps (inside the DB pod)
|
||||
kubectl exec -n dbaas mysql-standalone-0 -- ls -lah /backup/per-db/
|
||||
kubectl exec -n dbaas pg-cluster-0 -- ls -lah /backup/per-db/
|
||||
|
||||
# Pushgateway metrics
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- http://prometheus-prometheus-pushgateway:9091/metrics | \
|
||||
grep backup_last_success_timestamp
|
||||
|
||||
# LVM snapshots on PVE host
|
||||
ssh -o BatchMode=yes root@192.168.1.127 \
|
||||
'lvs -o lv_name,lv_time,lv_size --noheadings | grep snap'
|
||||
```
|
||||
|
||||
If offsite sync is stale, the common cause is the
|
||||
`offsite-sync-backup.service` systemd unit on the PVE host failing.
|
||||
`ssh root@192.168.1.127 'systemctl status offsite-sync-backup'`.
|
||||
|
||||
### Monitoring stack (checks 37, 38, 39)
|
||||
|
||||
```bash
|
||||
# Prometheus
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/-/ready
|
||||
kubectl logs -n monitoring deploy/prometheus-server --tail=100
|
||||
|
||||
# Alertmanager
|
||||
kubectl get pods -n monitoring | grep alertmanager
|
||||
kubectl logs -n monitoring -l app=prometheus-alertmanager --tail=100
|
||||
|
||||
# Vault
|
||||
kubectl exec -n vault vault-0 -- sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'
|
||||
# If sealed: check raft peers with `vault operator raft list-peers` and unseal.
|
||||
|
||||
# ClusterSecretStore
|
||||
kubectl get clustersecretstore
|
||||
kubectl describe clustersecretstore vault-kv vault-database
|
||||
kubectl logs -n external-secrets deploy/external-secrets --tail=100
|
||||
```
|
||||
|
||||
### External reachability (checks 40, 41, 42)
|
||||
|
||||
```bash
|
||||
# Cloudflared
|
||||
kubectl get pods -n cloudflared
|
||||
kubectl logs -n cloudflared -l app=cloudflared --tail=100
|
||||
|
||||
# Authentik (Helm chart names the deployment goauthentik-server)
|
||||
kubectl get deployment -n authentik goauthentik-server
|
||||
kubectl logs -n authentik deploy/goauthentik-server --tail=100
|
||||
|
||||
# ExternalAccessDivergence alert
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
|
||||
python3 -m json.tool | grep -A 5 ExternalAccessDivergence
|
||||
|
||||
# Traefik 5xx — find the hot service
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
### OOMKilled remediation
|
||||
|
||||
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Limits`
|
||||
2. Edit `infra/modules/kubernetes/<service>/main.tf` and raise
|
||||
`resources.limits.memory`.
|
||||
3. `cd /home/wizard/code/infra && scripts/tg apply` (Tier 1) or
|
||||
`terraform apply -target=module.<service>` as appropriate.
|
||||
|
||||
### ImagePullBackOff remediation
|
||||
|
||||
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Events`
|
||||
2. Verify tag exists on the source registry.
|
||||
3. Check pull-through cache at `10.0.20.10:{5000,5010,5020,5030}`.
|
||||
4. Update the image tag in Terraform + re-apply.
|
||||
|
||||
### Persistent CrashLoopBackOff after auto-fix
|
||||
|
||||
1. `kubectl logs -n <ns> <pod> --previous --tail=200`
|
||||
2. `kubectl describe pod -n <ns> <pod>` and check Last State:
|
||||
- `OOMKilled` → raise memory limit
|
||||
- Exit code 137 → OOM or probe killed
|
||||
- Exit code 143 → SIGTERM / graceful shutdown failed
|
||||
3. Cross-check dbaas + NFS + secrets are healthy.
|
||||
|
||||
## Performance forensics — top consumers + optimization hints
|
||||
|
||||
When the cluster is healthy (script returns 0) but the host is hot or load
|
||||
is elevated, switch from "what broke?" to "what's expensive?". Run these
|
||||
in order; stop as soon as the root cause is obvious.
|
||||
|
||||
### Step 1 — Snapshot top consumers cluster-wide
|
||||
|
||||
```bash
|
||||
# Top 15 pods by current CPU
|
||||
kubectl top pods --all-namespaces --sort-by=cpu --no-headers | head -15
|
||||
|
||||
# Top 5 nodes by CPU + memory pressure
|
||||
kubectl top nodes
|
||||
|
||||
# Top 15 by 5-min rolling rate (smoothed — kills noise from one-off spikes)
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(namespace,pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer!%3D''%7D%5B5m%5D)))" \
|
||||
| python3 -m json.tool | head -80
|
||||
```
|
||||
|
||||
### Step 2 — For each suspect pod, get the WHY
|
||||
|
||||
For every pod in the top-N, gather these BEFORE proposing a fix:
|
||||
|
||||
```bash
|
||||
NS=<namespace>; POD=<pod>; CONT=$(kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].name}')
|
||||
|
||||
# What it does (image + command)
|
||||
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].image}{"\n"}{.spec.containers[0].args}{"\n"}'
|
||||
|
||||
# Resource limits + current usage
|
||||
kubectl -n $NS top pod $POD --containers
|
||||
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}'
|
||||
|
||||
# Recent logs filtered for reconcile loops, watch storms, slow queries
|
||||
kubectl -n $NS logs $POD -c $CONT --tail=200 --since=5m 2>&1 \
|
||||
| grep -iE 'reconcil|watch|scrape|index|loop|retry|slow|timeout' | tail -20
|
||||
|
||||
# Restart count + recent OOM
|
||||
kubectl -n $NS describe pod $POD | grep -E 'Restart Count|Last State|Reason'
|
||||
|
||||
# Self-exported metrics (for apps that publish on /metrics)
|
||||
kubectl -n $NS exec $POD -c $CONT -- wget -qO- localhost:<port>/metrics 2>/dev/null | head -50
|
||||
```
|
||||
|
||||
### Step 3 — apiserver / etcd specific deep-dive (when control-plane is hot)
|
||||
|
||||
```bash
|
||||
# Top request producers by verb+resource (last 30 min)
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(resource,verb)%20(rate(apiserver_request_total%5B30m%5D)))" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# Top user agents (which clients are hammering)
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(user_agent)%20(rate(apiserver_request_total%5B30m%5D)))" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# Long-running requests (WATCH / CONNECT — log streams, pod-watchers)
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=apiserver_longrunning_requests" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# etcd write rate + DB size
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=rate(etcd_disk_wal_fsync_duration_seconds_count%5B5m%5D)" \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
### Step 4 — PVE host specific deep-dive (when temp / load is high)
|
||||
|
||||
Checks 43 + 44 capture package temp + 5-min load avg with PASS/WARN/FAIL
|
||||
thresholds — that's the first stop. When those WARN or FAIL, the
|
||||
follow-up commands below trace which VM / process is the source:
|
||||
|
||||
```bash
|
||||
# Per-core temps (broader than the package summary in check 43)
|
||||
ssh root@192.168.1.127 'for f in /sys/class/hwmon/hwmon0/temp*_input; do
|
||||
base=${f%_input}; label=$(cat ${base}_label 2>/dev/null || echo "${base##*/}")
|
||||
val=$(cat "$f"); echo " $label: $((val/1000))°C"
|
||||
done'
|
||||
|
||||
# Per-VM CPU (each VM = one kvm process)
|
||||
ssh root@192.168.1.127 'top -bn1 -o %CPU | grep kvm | head -10'
|
||||
|
||||
# pvestatd anomaly check — bursts > 50% usually mean LV count > 1000
|
||||
ssh root@192.168.1.127 'lvs --noheadings 2>/dev/null | wc -l'
|
||||
|
||||
# Stale snapshots (any '_pre-*' that survived past their rollback window)
|
||||
ssh root@192.168.1.127 'lvs --noheadings -o lv_name 2>/dev/null | awk "/_pre-/" | head -20'
|
||||
```
|
||||
|
||||
### Step 5 — Optimization decision
|
||||
|
||||
For each consumer in the top-N, fill in a row:
|
||||
|
||||
| Pod / Process | CPU (m) | Why busy | Tunable | Est saving | Trade-off | Effort |
|
||||
|---|---|---|---|---|---|---|
|
||||
|
||||
Then rank by ROI (saving / effort) and surface the top 3-5. **Hold back the ones where saving < 50m unless effort is also < 5 min.**
|
||||
|
||||
### Common causes + tunables (catalogue)
|
||||
|
||||
| Symptom | Likely cause | Tunable |
|
||||
|---|---|---|
|
||||
| **`kube-apiserver` > 1 core sustained** | `CONNECT pods/log` streams from `alloy`/`promtail` using apiserver-tail; OR Kyverno PolicyReport churn (background+enforce mode); OR VPA fanout (309 VPAs cause ~7 req/s) | Switch alloy/promtail to `loki.source.file`; raise Kyverno `backgroundScanInterval`; reduce VPA count |
|
||||
| **`pvestatd` 70-100% bursts** | LV metadata scan over > 1000 LVs (typically stale `_pre-*` snapshots from ad-hoc node ops) | Delete stale snapshots; `/usr/local/bin/lvm-pvc-snapshot prune` |
|
||||
| **Frigate > 2 cores** | Birdseye `mode: continuous` (16% on frigate.output); LPR debug; debug logging; too many active cameras × detect.fps | `birdseye.mode: motion`; `lpr.debug_save_plates: false`; remove debug loggers |
|
||||
| **`vault-0` looping ERRORs every ~10s** | DB static-role not in connection's `allowed_roles` list (drift between role and connection) | Add role to `vault_database_secret_backend_connection.*.allowed_roles` in TF |
|
||||
| **Alloy DS > 100m/pod** | `loki.source.kubernetes` (apiserver-tail) instead of `loki.source.file` | Switch to file-tail (~5× drop per pod) |
|
||||
| **Prometheus default 1m scrape** | Chart default; new sample every minute | Raise `server.global.scrape_interval` to 2m; pin critical jobs (snmp-ups) to 30s; bump `for: 1m` alerts to `for: 3m` |
|
||||
| **`kube-controller-manager` periodic ERROR loop** | Aggregated APIService discovery fails (calico/metrics-server unreachable, OR stuck Terminating pod still in endpoints) | Force-delete stuck pod; verify APIService Available; check pod runc bug on k8s-master |
|
||||
| **etcd write > 1 MB/s** | PolicyReport thrash, too-frequent secret rotation, or audit log mode = RequestResponse | Trim Kyverno reports config; raise rotation_period; downgrade audit policy to Metadata for noisy resources |
|
||||
|
||||
### What NOT to touch
|
||||
|
||||
- **calico-node, etcd write rate, kube-controller-manager core work, pg-cluster replication** — structural cost, touching them risks correctness.
|
||||
- **Pods doing legitimate request-serving work** (web servers, databases under load) — optimize the workload, not the runtime.
|
||||
- **Anything where Goldilocks VPA upperBound is already close to current request** — no headroom to cut.
|
||||
|
||||
### Source-of-truth notes
|
||||
|
||||
- **All infra mutations go via Terraform** (`scripts/tg plan/apply`). The recipes above are diagnostic; the FIX lives in `infra/stacks/<name>/main.tf` or chart values.
|
||||
- **Pod-internal config files** (e.g., Frigate's `/config/config.yml` on a PVC) are not TF-managed — edit in-pod and document in `infra/docs/runbooks/`.
|
||||
- **PVE host-level state** (LVM snapshots, pvestatd) — SSH + manual ops; record in memory if the pattern recurs.
|
||||
|
||||
## Notes on the canonical / hardlink setup
|
||||
|
||||
The authoritative copy of this SKILL.md lives at
|
||||
`/home/wizard/code/.claude/skills/cluster-health/SKILL.md`. A hardlink
|
||||
at `/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md`
|
||||
points to the same inode so infra-rooted sessions also discover the
|
||||
skill.
|
||||
|
||||
To verify the hardlink is intact:
|
||||
|
||||
```bash
|
||||
stat -c '%i %n' \
|
||||
/home/wizard/code/.claude/skills/cluster-health/SKILL.md \
|
||||
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
|
||||
```
|
||||
|
||||
Both should print the same inode number. If they diverge (e.g. `git
|
||||
checkout` replaced the file rather than updating it), re-link:
|
||||
|
||||
```bash
|
||||
ln -f /home/wizard/code/.claude/skills/cluster-health/SKILL.md \
|
||||
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
|
||||
```
|
||||
|
|
@ -1,215 +0,0 @@
|
|||
---
|
||||
name: disk-wear
|
||||
description: |
|
||||
Analyze disk write patterns on the PVE host to assess wear and identify
|
||||
top writers by VM, k8s app, and PVC. Use when:
|
||||
(1) User asks about disk wear, disk writes, or storage health,
|
||||
(2) User says "what's wearing the disk", "disk analysis", "I/O analysis",
|
||||
(3) User wants to check write rates by VM, k8s namespace, or PVC,
|
||||
(4) Periodic quarterly disk health review.
|
||||
Combines PVE host I/O stats (SSH), Prometheus metrics (PromQL), and
|
||||
k8s PVC-to-pod mapping for a full breakdown.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-04-17
|
||||
---
|
||||
|
||||
# Disk Wear Analysis
|
||||
|
||||
## Infrastructure
|
||||
|
||||
| Resource | Address | Notes |
|
||||
|----------|---------|-------|
|
||||
| PVE host | `root@192.168.1.127` (SSH) | Dell R730, PERC H730 RAID |
|
||||
| Prometheus | `prometheus-server.monitoring.svc:80` | Query via alertmanager pod (wget) |
|
||||
| SSD | Slot 4, Samsung 850 EVO 1TB | Rated 150 TBW |
|
||||
| HDD sdc | RAID1 (2x 11.7TB SAS 7200RPM) | Main data disk, enterprise rated ~550 TB/yr |
|
||||
| HDD sda | 1.2TB SAS 10K RPM | Backup only |
|
||||
|
||||
## Step 1: Physical Disk Overview + SSD Health
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'echo "=== UPTIME ===" && uptime && echo "" && \
|
||||
echo "=== PHYSICAL DISK CUMULATIVE (since boot) ===" && iostat -d -k sda sdb sdc 2>/dev/null && echo "" && \
|
||||
echo "=== SSD SMART (Samsung 850 EVO, slot 4) ===" && \
|
||||
smartctl -d sat+megaraid,4 -A /dev/sda 2>/dev/null | grep -iE "power_on|reallocat|written|wear|pending|uncorrect"'
|
||||
```
|
||||
|
||||
**Interpret SSD health:**
|
||||
- `Wear_Leveling_Count`: 100 = new, 0 = dead. Calculate `(100 - value)%` wear used.
|
||||
- `Total_LBAs_Written`: multiply by 512 bytes for total TB written. Compare against 150 TBW rating.
|
||||
- Estimate remaining life: `(150 TBW - current TBW) / annual write rate`.
|
||||
|
||||
## Step 2: Real-Time Snapshot (30 seconds)
|
||||
|
||||
SSH to PVE host and take two reads of block device stats 30 seconds apart. This gives instantaneous write rates independent of Prometheus scrape intervals.
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'bash -s' << 'SCRIPT'
|
||||
echo "=== 30-SECOND SNAPSHOT ($(date)) ==="
|
||||
declare -A snap1
|
||||
for dm in /sys/block/dm-*; do
|
||||
name=$(basename $dm)
|
||||
snap1[$name]=$(cat $dm/stat 2>/dev/null | awk '{print $7}')
|
||||
done
|
||||
for d in sda sdb sdc; do
|
||||
snap1[$d]=$(cat /sys/block/$d/stat 2>/dev/null | awk '{print $7}')
|
||||
done
|
||||
|
||||
sleep 30
|
||||
|
||||
printf "%-12s %10s %10s %s\n" "DEVICE" "kB/s" "GB/day" "NAME"
|
||||
echo "-------------------------------------------------------------------"
|
||||
results=""
|
||||
for dm in /sys/block/dm-*; do
|
||||
name=$(basename $dm)
|
||||
s2=$(cat $dm/stat 2>/dev/null | awk '{print $7}')
|
||||
s1=${snap1[$name]:-0}
|
||||
diff=$((s2 - s1))
|
||||
if [ "$diff" -gt 100 ]; then
|
||||
kbps=$((diff / 2 / 30))
|
||||
gbday=$(echo "scale=1; $kbps * 86400 / 1048576" | bc)
|
||||
lvm=$(dmsetup info --columns --noheadings -o name /dev/$name 2>/dev/null)
|
||||
results="$results\n$name $kbps $gbday $lvm"
|
||||
fi
|
||||
done
|
||||
for d in sda sdb sdc; do
|
||||
s2=$(cat /sys/block/$d/stat 2>/dev/null | awk '{print $7}')
|
||||
s1=${snap1[$d]:-0}
|
||||
diff=$((s2 - s1))
|
||||
kbps=$((diff / 2 / 30))
|
||||
gbday=$(echo "scale=1; $kbps * 86400 / 1048576" | bc)
|
||||
results="$results\n$d $kbps $gbday (physical)"
|
||||
done
|
||||
echo -e "$results" | sort -k2 -rn | head -30 | while read dev kbps gbday name; do
|
||||
printf "%-12s %8s kB/s %8s GB/day %s\n" "$dev" "$kbps" "$gbday" "$name"
|
||||
done
|
||||
SCRIPT
|
||||
```
|
||||
|
||||
## Step 3: Prometheus — Per-App Write Attribution
|
||||
|
||||
Query Prometheus from inside the cluster (alertmanager pod has wget).
|
||||
|
||||
### 3a. Top PVC Writers (1h rate)
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
|
||||
--post-data='query=topk(20,rate(node_disk_written_bytes_total{instance=~"pve.*"}[1h])*on(device)group_left(lv_name,vg_name)node_disk_device_mapper_info{instance=~"pve.*",lv_name=~"vm-9999-pvc-.*"})' \
|
||||
2>/dev/null | python3 -c "
|
||||
import json,sys
|
||||
d=json.load(sys.stdin)
|
||||
for r in d['data']['result']:
|
||||
m = r['metric']
|
||||
val = float(r['value'][1])
|
||||
gb_day = val * 86400 / 1073741824
|
||||
if gb_day > 0.05:
|
||||
lv = m.get('lv_name','?').replace('vm-9999-','')
|
||||
print(f'{gb_day:8.1f} GB/day {lv}')
|
||||
"
|
||||
```
|
||||
|
||||
Then enrich PVC UUIDs with names:
|
||||
```bash
|
||||
kubectl get pv -o custom-columns=NAME:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace | grep "pvc-<UUID>"
|
||||
```
|
||||
|
||||
### 3b. Top VM Writers (1h rate)
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
|
||||
--post-data='query=topk(10,rate(node_disk_written_bytes_total{instance=~"pve.*"}[1h])*on(device)group_left(lv_name,vg_name)node_disk_device_mapper_info{instance=~"pve.*",lv_name!~"vm-9999-.*|root|swap|data.*|nfs.*|backup.*|ssd.*"})' \
|
||||
2>/dev/null | python3 -c "
|
||||
import json,sys
|
||||
d=json.load(sys.stdin)
|
||||
for r in d['data']['result']:
|
||||
m = r['metric']
|
||||
val = float(r['value'][1])
|
||||
gb_day = val * 86400 / 1073741824
|
||||
print(f'{gb_day:8.1f} GB/day {m.get(\"lv_name\",\"?\")}')
|
||||
"
|
||||
```
|
||||
|
||||
Enrich VM IDs with names:
|
||||
```bash
|
||||
ssh root@192.168.1.127 'qm list' 2>/dev/null
|
||||
```
|
||||
|
||||
### 3c. Aggregate PVC Writes by K8s Namespace
|
||||
|
||||
After collecting the top PVC writers from 3a, map each PVC UUID to its namespace using `kubectl get pv`, then sum by namespace. Present as a table:
|
||||
|
||||
| Namespace | GB/day | Top PVC |
|
||||
|-----------|--------|---------|
|
||||
| dbaas | ... | mysql-standalone, pg-cluster |
|
||||
| monitoring | ... | prometheus-data |
|
||||
|
||||
### 3d. Historical Trend (7-day total)
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
|
||||
--post-data='query=topk(10,increase(node_disk_written_bytes_total{instance=~"pve.*",device=~"sda|sdb|sdc"}[7d]))' \
|
||||
2>/dev/null | python3 -c "
|
||||
import json,sys
|
||||
d=json.load(sys.stdin)
|
||||
for r in d['data']['result']:
|
||||
m = r['metric']
|
||||
val = float(r['value'][1])
|
||||
tb = val / 1099511627776
|
||||
print(f'{tb:8.2f} TB/7d device={m.get(\"device\",\"?\")}')
|
||||
"
|
||||
```
|
||||
|
||||
## Step 4: Interpretation
|
||||
|
||||
### Baselines
|
||||
|
||||
| Metric | Healthy | Warning | Critical |
|
||||
|--------|---------|---------|----------|
|
||||
| sdc (HDD RAID1) annualized | <200 TB/yr | 200-400 TB/yr | >400 TB/yr |
|
||||
| sdb (SSD) wear used | <50% | 50-80% | >80% |
|
||||
| Single PVC write rate | <20 GB/day | 20-50 GB/day | >50 GB/day |
|
||||
| Single VM write rate | <50 GB/day | 50-100 GB/day | >100 GB/day |
|
||||
| NFS volume total | <20 GB/day | 20-50 GB/day | >50 GB/day |
|
||||
|
||||
### Known Write Sources (expected baseline, April 2026)
|
||||
|
||||
| Source | Expected GB/day | Notes |
|
||||
|--------|----------------|-------|
|
||||
| MySQL standalone | 5-10 | uptimekuma heartbeats + phpipam. `skip-log-bin`, no GR |
|
||||
| PostgreSQL cluster | 5-15 | Technitium DNS query logs (90-day retention) + app DBs |
|
||||
| k8s-master etcd | 30-50 | etcd WAL + snapshot compaction |
|
||||
| k8s-node VMs | 10-30 each | containerd layers, kubelet journals, ephemeral storage |
|
||||
| Prometheus | 3-5 | TSDB compaction |
|
||||
| home-assistant | 10-15 | Recorder database (SQLite/MariaDB) |
|
||||
| NFS volume | 5-10 | Minimal after TrueNAS deprecation |
|
||||
|
||||
### Red Flags (investigate immediately)
|
||||
|
||||
- Any single PVC >50 GB/day
|
||||
- MySQL `log_bin` = ON (should be OFF — `skip-log-bin` in standalone config)
|
||||
- Technitium MySQL or SQLite query log plugins re-installed (should be uninstalled)
|
||||
- NFS writes >30 GB/day (media ingestion or backup churn)
|
||||
- SSD wear >80% or projected life <2 years
|
||||
- k8s node VM writes >100 GB/day (something writing heavily to ephemeral storage)
|
||||
|
||||
## Step 5: Report Format
|
||||
|
||||
Present findings as three tables:
|
||||
|
||||
**1. Physical Disks**
|
||||
| Disk | Type | 7d Total | Rate GB/day | Annualized | Status |
|
||||
|------|------|----------|-------------|------------|--------|
|
||||
|
||||
**2. Top Writers (VMs + PVCs combined, sorted by rate)**
|
||||
| Rank | Name | Type | GB/day | Status | Notes |
|
||||
|------|------|------|--------|--------|-------|
|
||||
|
||||
**3. By K8s Namespace**
|
||||
| Namespace | PVC Writes GB/day | Top Contributor |
|
||||
|-----------|-------------------|-----------------|
|
||||
|
||||
End with:
|
||||
- Annualized wear projections
|
||||
- Comparison with previous run (if user provides one)
|
||||
- Action items for any WARNING/CRITICAL findings
|
||||
|
|
@ -1,90 +0,0 @@
|
|||
---
|
||||
name: extend-vm-storage
|
||||
description: |
|
||||
Extend disk storage on a Kubernetes node VM (Proxmox-hosted).
|
||||
Use when: (1) User wants to increase disk space on a k8s node VM,
|
||||
(2) A node is running low on disk, (3) User says "extend storage"
|
||||
or "add disk space". Automates: drain → shutdown → resize → boot →
|
||||
expand filesystem → uncordon.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-01
|
||||
---
|
||||
|
||||
# Extend VM Storage Skill
|
||||
|
||||
**Purpose**: Extend disk storage on a Kubernetes node VM (Proxmox-hosted).
|
||||
|
||||
**When to use**: User wants to increase disk space on a k8s node VM, or a node is running low on disk.
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Identify the Node
|
||||
|
||||
Ask the user which node needs more storage and how much to add.
|
||||
|
||||
Valid nodes: `k8s-master`, `k8s-node1`, `k8s-node2`, `k8s-node3`, `k8s-node4`
|
||||
|
||||
### 2. Run the Script
|
||||
|
||||
```bash
|
||||
./scripts/extend_vm_storage.sh <node-name> <size-increment>
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
./scripts/extend_vm_storage.sh k8s-node2 +64G
|
||||
```
|
||||
|
||||
### 3. What the Script Does
|
||||
|
||||
1. Validates inputs (node name and size format)
|
||||
2. Resolves node IP via kubectl
|
||||
3. Prompts for confirmation
|
||||
4. Drains the node (evicts pods)
|
||||
5. Shuts down the VM in Proxmox
|
||||
6. Resizes the disk (`scsi0`) by the given increment
|
||||
7. Starts the VM and waits for SSH
|
||||
8. Expands the filesystem inside the guest (auto-detects LVM vs direct partition)
|
||||
9. Uncordons the node
|
||||
10. Shows verification output (`df -h` and node status)
|
||||
|
||||
### 4. Update Terraform (if needed)
|
||||
|
||||
If you want Terraform to reflect the new disk size, update the VM definition in `main.tf` or `modules/create-vm/` so that a future `terraform apply` doesn't revert the change. Check if the VM disk size is managed by Terraform:
|
||||
|
||||
```bash
|
||||
grep -A5 "disk" main.tf | grep -i size
|
||||
```
|
||||
|
||||
If managed, update the size value to match the new total.
|
||||
|
||||
### 5. Verification
|
||||
|
||||
After the script completes, verify:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get nodes
|
||||
ssh wizard@<node-ip> "df -h /"
|
||||
```
|
||||
|
||||
## Recovery
|
||||
|
||||
If the script fails mid-way:
|
||||
1. Check VM status: `ssh root@192.168.1.127 "qm status <vmid>"`
|
||||
2. Start VM if stopped: `ssh root@192.168.1.127 "qm start <vmid>"`
|
||||
3. Uncordon node: `kubectl --kubeconfig $(pwd)/config uncordon <node-name>`
|
||||
|
||||
## Constants
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Proxmox host | `root@192.168.1.127` |
|
||||
| VM SSH user | `wizard` |
|
||||
| Disk name | `scsi0` |
|
||||
| Shutdown timeout | 300s |
|
||||
| SSH wait timeout | 300s |
|
||||
|
||||
## Questions to Ask User
|
||||
|
||||
1. Which node needs more storage?
|
||||
2. How much storage to add? (e.g., +64G)
|
||||
|
|
@ -1,487 +0,0 @@
|
|||
---
|
||||
name: home-assistant
|
||||
description: |
|
||||
Control Home Assistant smart home devices and automations. Use when:
|
||||
(1) User asks to turn on/off lights, switches, or devices,
|
||||
(2) User asks about the state of sensors, devices, or entities,
|
||||
(3) User says "turn on the lights", "set temperature", "lock the door",
|
||||
(4) User asks to run a scene or script,
|
||||
(5) User asks "what devices are on?" or "is the door locked?",
|
||||
(6) User mentions smart home, IoT, or home automation.
|
||||
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
|
||||
Always use Home Assistant for smart home control.
|
||||
author: Claude Code
|
||||
version: 2.0.0
|
||||
date: 2026-02-07
|
||||
---
|
||||
|
||||
# Home Assistant Control
|
||||
|
||||
## Problem
|
||||
Need to control smart home devices, check sensor states, or run automations via Home Assistant.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- User asks to control lights, switches, covers, climate, etc.
|
||||
- User asks about device states ("is the light on?", "what's the temperature?")
|
||||
- User wants to run a scene or script
|
||||
- User mentions turning things on/off
|
||||
- User asks about smart home devices
|
||||
|
||||
## Deployments
|
||||
|
||||
There are **two** Home Assistant instances:
|
||||
|
||||
| Instance | URL | SSH | Default? |
|
||||
|----------|-----|-----|----------|
|
||||
| **ha-london** | `https://ha-london.viktorbarzin.me` | `ssh hassio@192.168.8.103` | Yes |
|
||||
| **ha-sofia** | `https://ha-sofia.viktorbarzin.me` | `ssh vbarzin@192.168.1.8` | No |
|
||||
|
||||
- **Default**: ha-london (use unless user specifies "sofia" or "ha-sofia")
|
||||
- **Aliases**: "ha" or "HA" = ha-london. "ha sofia" or "ha-sofia" = ha-sofia.
|
||||
|
||||
## Prerequisites
|
||||
- Python 3 with `requests` package available (installed via PYTHONPATH or system packages)
|
||||
- Environment variables for each instance:
|
||||
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
|
||||
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
|
||||
|
||||
## API Control
|
||||
|
||||
### Scripts
|
||||
|
||||
| Instance | Script |
|
||||
|----------|--------|
|
||||
| ha-london | `.claude/home-assistant.py` |
|
||||
| ha-sofia | `.claude/home-assistant-sofia.py` |
|
||||
|
||||
### Execution Pattern (CRITICAL)
|
||||
Run the scripts directly with python3 (env vars are set in the environment):
|
||||
|
||||
```bash
|
||||
# ha-london (default)
|
||||
python3 .claude/home-assistant.py [command] [options]
|
||||
|
||||
# ha-sofia
|
||||
python3 .claude/home-assistant-sofia.py [command] [options]
|
||||
```
|
||||
|
||||
### Available Commands
|
||||
|
||||
#### List Entities
|
||||
```bash
|
||||
# List all entities
|
||||
python .claude/home-assistant.py list
|
||||
|
||||
# List by domain
|
||||
python .claude/home-assistant.py list --domain light
|
||||
python .claude/home-assistant.py list --domain switch
|
||||
python .claude/home-assistant.py list --domain sensor
|
||||
python .claude/home-assistant.py list --domain climate
|
||||
python .claude/home-assistant.py list --domain cover
|
||||
|
||||
# JSON output
|
||||
python .claude/home-assistant.py list --json
|
||||
```
|
||||
|
||||
#### Search Entities
|
||||
```bash
|
||||
# Search by name or ID
|
||||
python .claude/home-assistant.py search "living room"
|
||||
python .claude/home-assistant.py search "temperature"
|
||||
python .claude/home-assistant.py search "door"
|
||||
```
|
||||
|
||||
#### Get Entity State
|
||||
```bash
|
||||
python .claude/home-assistant.py state light.living_room
|
||||
python .claude/home-assistant.py state sensor.temperature
|
||||
python .claude/home-assistant.py state --json light.living_room
|
||||
```
|
||||
|
||||
#### Control Entities
|
||||
```bash
|
||||
# Turn on/off
|
||||
python .claude/home-assistant.py on light.living_room
|
||||
python .claude/home-assistant.py off switch.tv
|
||||
python .claude/home-assistant.py toggle light.bedroom
|
||||
|
||||
# Set values
|
||||
python .claude/home-assistant.py set light.living_room 75 # brightness %
|
||||
python .claude/home-assistant.py set climate.thermostat 22 # temperature
|
||||
python .claude/home-assistant.py set cover.blinds 50 # position %
|
||||
python .claude/home-assistant.py set input_number.volume 80 # numeric value
|
||||
python .claude/home-assistant.py set input_boolean.away_mode on # boolean
|
||||
python .claude/home-assistant.py set input_select.mode "Night" # select option
|
||||
```
|
||||
|
||||
#### Run Scenes and Scripts
|
||||
```bash
|
||||
# Activate a scene
|
||||
python .claude/home-assistant.py scene movie_night
|
||||
python .claude/home-assistant.py scene scene.good_morning
|
||||
|
||||
# Run a script
|
||||
python .claude/home-assistant.py script bedtime_routine
|
||||
python .claude/home-assistant.py script script.welcome_home
|
||||
```
|
||||
|
||||
#### Call Any Service
|
||||
```bash
|
||||
# Generic service call
|
||||
python .claude/home-assistant.py service light turn_on --entity light.kitchen --data '{"brightness": 255}'
|
||||
python .claude/home-assistant.py service climate set_hvac_mode --entity climate.living_room --data '{"hvac_mode": "heat"}'
|
||||
python .claude/home-assistant.py service media_player play_media --entity media_player.tv --data '{"media_content_id": "...", "media_content_type": "video"}'
|
||||
```
|
||||
|
||||
#### List Services
|
||||
```bash
|
||||
# List all available services
|
||||
python .claude/home-assistant.py services
|
||||
|
||||
# Filter by domain
|
||||
python .claude/home-assistant.py services --domain light
|
||||
python .claude/home-assistant.py services --domain climate
|
||||
```
|
||||
|
||||
#### Send Notifications
|
||||
```bash
|
||||
python .claude/home-assistant.py notify "Door left open!"
|
||||
python .claude/home-assistant.py notify "Motion detected" --title "Security Alert"
|
||||
python .claude/home-assistant.py notify "Hello" --target notify.mobile_app
|
||||
```
|
||||
|
||||
## SSH Access (ha-sofia only)
|
||||
|
||||
ha-sofia supports SSH for direct configuration management.
|
||||
|
||||
### Connection
|
||||
```bash
|
||||
ssh vbarzin@192.168.1.8
|
||||
```
|
||||
|
||||
### Configuration Path
|
||||
```
|
||||
/config/
|
||||
```
|
||||
|
||||
### Common SSH Tasks
|
||||
```bash
|
||||
# Read configuration
|
||||
ssh vbarzin@192.168.1.8 "cat /config/configuration.yaml"
|
||||
|
||||
# Check HA logs (note: live log is inside HA Core container, not always accessible)
|
||||
ssh vbarzin@192.168.1.8 "tail -50 /config/home-assistant.log.1"
|
||||
|
||||
# List config files
|
||||
ssh vbarzin@192.168.1.8 "ls /config/*.yaml"
|
||||
|
||||
# Read automations/scenes/scripts
|
||||
ssh vbarzin@192.168.1.8 "cat /config/automations.yaml"
|
||||
ssh vbarzin@192.168.1.8 "cat /config/scenes.yaml"
|
||||
ssh vbarzin@192.168.1.8 "cat /config/scripts.yaml"
|
||||
|
||||
# Check secrets (keys only, not values)
|
||||
ssh vbarzin@192.168.1.8 "cat /config/secrets.yaml"
|
||||
```
|
||||
|
||||
### SSH Limitations
|
||||
- The SSH add-on runs in a separate container — `ha core logs` returns 401
|
||||
- Docker socket is not accessible — can't use `docker logs`
|
||||
- Live `home-assistant.log` may not be visible (written inside HA Core container)
|
||||
- Rotated logs (`.log.1`, `.log.old`) are accessible
|
||||
|
||||
## Complete Example
|
||||
|
||||
To turn on the living room light on ha-london:
|
||||
```bash
|
||||
python3 .claude/home-assistant.py on light.living_room
|
||||
```
|
||||
|
||||
To check ha-sofia configuration:
|
||||
```bash
|
||||
ssh vbarzin@ha-sofia.viktorbarzin.lan "cat /config/configuration.yaml"
|
||||
```
|
||||
|
||||
## Common Entity Domains
|
||||
|
||||
| Domain | Description | Common Actions |
|
||||
|--------|-------------|----------------|
|
||||
| `light` | Lights | on, off, toggle, set brightness |
|
||||
| `switch` | Switches | on, off, toggle |
|
||||
| `sensor` | Sensors | state (read-only) |
|
||||
| `binary_sensor` | Binary sensors | state (read-only) |
|
||||
| `climate` | Thermostats | set temperature, set mode |
|
||||
| `cover` | Blinds/covers | open, close, set position |
|
||||
| `lock` | Locks | lock, unlock |
|
||||
| `media_player` | Media devices | play, pause, volume |
|
||||
| `input_boolean` | Helper toggles | on, off |
|
||||
| `input_number` | Helper numbers | set value |
|
||||
| `input_select` | Helper dropdowns | select option |
|
||||
| `script` | Scripts | run |
|
||||
| `scene` | Scenes | activate |
|
||||
| `automation` | Automations | trigger, on, off |
|
||||
|
||||
## Verification
|
||||
- Commands print confirmation message on success
|
||||
- Use `state` command to verify entity changed
|
||||
- Exit code 0 = success, 1 = error
|
||||
|
||||
## Common Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| `HOME_ASSISTANT_URL and HOME_ASSISTANT_TOKEN must be set` | Env vars not set | Ensure `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN` are in the environment |
|
||||
| `404 Not Found` | Entity doesn't exist | Use `search` command to find correct entity ID |
|
||||
| `401 Unauthorized` | Token invalid/expired | Generate new long-lived token in HA |
|
||||
| `Connection refused` | HA not reachable | Check URL and network connectivity |
|
||||
|
||||
## Notes
|
||||
|
||||
1. **Entity IDs are case-sensitive** - use `search` to find exact IDs
|
||||
2. **Token must have sufficient permissions** - ensure token has access to all entities
|
||||
3. **Some entities require specific data** - use `services` command to see required fields
|
||||
4. **Two instances**: ha-london (default, K8s), ha-sofia (SSH + API)
|
||||
5. **ha-sofia SSH**: Uses default SSH key, user `vbarzin`, resolve DNS via `192.168.1.2`. Only reachable from local Sofia network (not remotely).
|
||||
|
||||
---
|
||||
|
||||
## ha-sofia Knowledge Map
|
||||
|
||||
### Overview
|
||||
- **1,087 entities** across 29 domains, **128 devices**, **13 areas**, **43 automations**
|
||||
- **Location**: Sofia, Bulgaria (Вермонт / Vermont neighborhood)
|
||||
- **4 tracked people**: Viktor Barzin, Emil Barzin, Valia Barzina, MQTT
|
||||
|
||||
### Key Systems
|
||||
|
||||
#### 1. Heating & Gas Boiler (EMS-ESP)
|
||||
- Buderus/Bosch gas boiler via EMS-ESP integration
|
||||
- Entities: `sensor.boiler_*`, `number.boiler_*`, `switch.boiler_*`
|
||||
- DHW (hot water), heating curves, burner stats, gas metering
|
||||
- Outside temp: `sensor.boiler_outside_temperature`
|
||||
|
||||
#### 2. Climate / Thermostats (4 rooms + bathroom)
|
||||
| Room | Entity | Bulgarian |
|
||||
|------|--------|-----------|
|
||||
| Children's room | `climate.thermostat_children_room` | Детска |
|
||||
| Office | `climate.thermostat_office_room` | Кабинет |
|
||||
| Living room | `climate.thermostat_living_room` | Хол |
|
||||
| Master bedroom | `climate.thermostat_master_bedroom` | род. Спалня |
|
||||
| Bathroom (Valchedram) | `climate.bania_vlchedrm` | Баня Вълчедръм |
|
||||
|
||||
#### 3. Solar / Photovoltaic (Solarman)
|
||||
- Inverter: `sensor.fv_b_*` (FV = фотоволтаици)
|
||||
- Battery, grid/self-use EMS mode, solar forecast
|
||||
- Energy totals tracked per grid/inverter
|
||||
|
||||
#### 4. ATS (Automatic Transfer Switch)
|
||||
- Grid ↔ inverter switching: `sensor.ats_*`
|
||||
- Load power, grid/inverter voltage, energy totals
|
||||
|
||||
#### 5. Security / Alarm (Paradox EVOHD+)
|
||||
- 3 alarm partitions: Apartment, Garage, Valchedram
|
||||
- PIR zones, door contacts, tamper sensors, PGMs for garage doors/doorbells
|
||||
|
||||
#### 6. Cameras / NVR / Frigate
|
||||
- Hikvision NVR (DS-7632NXI) with 9 cameras
|
||||
- Frigate NVR with object detection:
|
||||
- **Vermont** (home): cameras 10, 15, 16 — car/plate recognition
|
||||
- **Valchedram** (country): cameras 1, 2 — person detection
|
||||
- Object tracking: vehicles (Emo Skoda), cats (Мичка)
|
||||
|
||||
#### 7. Smart Appliances (Home Connect / Bosch-Siemens)
|
||||
| Appliance | Entity prefix | Bulgarian |
|
||||
|-----------|--------------|-----------|
|
||||
| Dishwasher | `*.miialna_mashina_*` | Миялна машина |
|
||||
| Washing machine | `*.peralnia_*` | Пералня (with i-Dos) |
|
||||
| Dryer | `*.sushilnia_*` | Сушилня |
|
||||
|
||||
#### 8. LED Strip Controllers (6-channel each)
|
||||
- Kitchen upper/lower: `light.kukhnia_*_socket_1-6`
|
||||
- Children's wardrobe: `light.led_detska_garderob_socket_1-6`
|
||||
- Hall wardrobe: `light.led_garderob_khol_socket_1-6`
|
||||
- Corridor wardrobe: `light.led_garderob_koridor_socket_1-6` (offline)
|
||||
- Master bedroom wardrobe: `light.led_garderob_rod_spalnia_socket_1-6` (offline)
|
||||
|
||||
#### 9. Media
|
||||
- Sony BRAVIA XR-65A80L (AirPlay + DLNA)
|
||||
- Marantz ND8006 (AirPlay + DLNA)
|
||||
|
||||
#### 10. Networking
|
||||
- TP-Link Archer AX6000 (main router)
|
||||
- TP-Link Archer MR200 (LTE backup)
|
||||
|
||||
#### 11. UPS
|
||||
- `sensor.ups_*` — battery, load, voltage, remaining time
|
||||
|
||||
#### 12. Ventilation (Pax BLE)
|
||||
- `sensor.ventilator_mokro_2_*` — bathroom fan with humidity/light sensors
|
||||
|
||||
#### 13. Synology NAS
|
||||
- **NAS_Barzini**: CPU 2%, Memory 26%, 2 drives (39C/41C)
|
||||
- Volume 1: 87.2% used (5.02 TB), status "attention"
|
||||
- DSM update available
|
||||
|
||||
#### 14. Printer
|
||||
- **HP ColorLaserJet M253-M254**: Black 49%, Cyan 88%, Magenta 91%, Yellow 90%
|
||||
|
||||
#### 15. Dell R730 Server (via iDRAC)
|
||||
- CPU temp 57C, Power 192W, Inlet 24C, Exhaust 29C
|
||||
- Tesla T4 GPU: 41C, 4% util, 4183MB VRAM, 32W
|
||||
|
||||
#### 16. Other Devices
|
||||
- **Dehumidifier** (Tuya): `humidifier.arete_*`
|
||||
- **Robot vacuum** (Rumi): `vacuum.rumi` — docked, 100% battery, 227 missions
|
||||
- **Tuya lights**: `light.krushka_*` (4 bulbs, currently offline)
|
||||
- **AC unit** (MELCloud): `climate.klimatik` — off, 23C
|
||||
- **Mistral AI**: Conversation integration (Devstral 2)
|
||||
|
||||
### Integrations
|
||||
HACS, ESPHome, Frigate, Home Connect, Paradox (PAI), Solarman, Pax BLE, Hikvision, InfluxDB, Mosquitto MQTT, Node-RED, Music Assistant, Zigbee2MQTT, Spook, Xtend Tuya, MELCloud, Synology DSM, HP Printer (IPP)
|
||||
|
||||
### Add-ons
|
||||
Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Frigate, PAI, Music Assistant, ESPHome, Ookla Speedtest, HA USB/IP Client, **Home Assistant Version Control**
|
||||
|
||||
### Version Control (Git Config Tracking)
|
||||
- **Add-on**: Home Assistant Version Control v1.2.0 (slug: `4ab554b2_home-assistant-version-control`)
|
||||
- **Add-on repo**: `https://github.com/saihgupr/ha-addons`
|
||||
- **What it does**: Auto-tracks every config file change via git. File watcher (inotify) detects changes, debounces (5s default), commits automatically.
|
||||
- **Tracked files**: `.yaml`, `.yml`, `.json`, `.conf`, `.sh`, `.py` + `.storage/` (lovelace dashboards, entity/device registries, config entries)
|
||||
- **Excluded**: `secrets.yaml`, database files (`.db`), logs, `__pycache__`, binary files
|
||||
- **Git repo**: `/homeassistant/.git` (owned by root; SSH user needs `git config --global --add safe.directory /homeassistant`)
|
||||
- **GitHub remote**: `https://github.com/ViktorBarzin/ha-sofia-config` (private). Auth token from Vault `secret/viktor` key `github_pat`. Cloud sync pushes hourly.
|
||||
- **Web UI**: Sidebar → "Version Control", or Settings → Add-ons → HA Version Control → Open Web UI. Ingress URL: `/api/hassio_ingress/PYR_EdVzPtzZdRnGjrhI3qbGogCVJ18FrtOg6oaBf-w/`
|
||||
- **Features**: Browse commit history with diffs, restore individual files or full config to any point, delete recovery, smart reloads after restore
|
||||
- **API**: `POST /api/git/add-all-and-commit` (manual backup), `GET /api/git/history` (commit log), `POST /api/restore-file` (restore single file), `POST /api/restore-commit` (full rollback)
|
||||
- **SSH git access**: `ssh vbarzin@192.168.1.8 'git -C /homeassistant log --oneline -10'`
|
||||
|
||||
### Music Assistant (MASS)
|
||||
- **Addon slug**: `d5369777_music_assistant`
|
||||
- **Version**: 2.7.8
|
||||
- **Web UI**: `http://192.168.1.8:8095`
|
||||
- **Container name**: `addon_d5369777_music_assistant`
|
||||
- **Providers**: Spotify (OAuth PKCE + librespot), TuneIn Radio, RadioBrowser, BBC Sounds, Radio Paradise, Filesystem (remote share)
|
||||
- **Player providers**: UPnP/DLNA, AirPlay, Sendspin (port 8927)
|
||||
- **Registered players**: Marantz ND8006 (DLNA + AirPlay), Sony BRAVIA XR-65A80L (AirPlay), Web (Chrome)
|
||||
- **Librespot cache**: `/data/.cache/spotify--5s3mSP8y/credentials.json` (inside addon container)
|
||||
- **Troubleshooting**: See skill `music-assistant-librespot-wrong-account` for Spotify playback failures
|
||||
- **SSH addon access to container**: `sudo curl -s --unix-socket /run/docker.sock http://localhost/containers/<id>/exec` (requires sudo)
|
||||
|
||||
### Zones
|
||||
- **Вермонт** (Vermont) — Home
|
||||
- **Вълчедръм** (Valchedram) — Country house
|
||||
|
||||
### Bulgarian ↔ English Room Names
|
||||
| Bulgarian | English | Entity prefix |
|
||||
|-----------|---------|---------------|
|
||||
| Детска | Children's room | `detska` |
|
||||
| Кабинет | Office | `kabinet` |
|
||||
| Хол | Living room | `khol` |
|
||||
| Спалня / род. Спалня | Master bedroom | `rod_spalnia` |
|
||||
| Кухня | Kitchen | `kukhnia` |
|
||||
| Коридор | Corridor | `koridor` |
|
||||
| Баня | Bathroom | `bania` |
|
||||
| Гараж | Garage | `garaj` |
|
||||
| Мазе | Basement | `maze` |
|
||||
|
||||
---
|
||||
|
||||
## ha-london Knowledge Map
|
||||
|
||||
### Overview
|
||||
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
|
||||
- **Location**: London, UK
|
||||
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
|
||||
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
||||
- **Config path**: `/config/` (requires `sudo` for file access)
|
||||
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
|
||||
- **Zone**: London (home)
|
||||
|
||||
### Key Systems
|
||||
|
||||
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
|
||||
Named plugs with power/energy tracking:
|
||||
|
||||
| Name | Entity | Usage/month | Purpose |
|
||||
|------|--------|-------------|---------|
|
||||
| Thor | `switch.thor` | 6.4 kWh | Server/NAS |
|
||||
| Pikkachu | `switch.pikkachu` | 4.8 kWh | Water cooler |
|
||||
| Michelle | `switch.emeter_plug` | 0.3 kWh | — |
|
||||
| Livia | `switch.livia` | 0.07 kWh | — |
|
||||
| Jinx | `switch.jinx` | 0.02 kWh | — |
|
||||
| Projector plug | `switch.tapo_p100` | unavailable | Tapo P100 |
|
||||
|
||||
#### 2. Air Quality (Apollo AIR-1 via ESPHome)
|
||||
- `sensor.apollo_air_1_fa2d34_co2`: CO2 level
|
||||
- `sensor.apollo_air_1_fa2d34_sen55_temperature`: Temperature
|
||||
- `sensor.apollo_air_1_fa2d34_sen55_humidity`: Humidity
|
||||
- PM1.0/2.5/4.0/10 particulate sensors
|
||||
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
|
||||
|
||||
#### 3. Cowboy E-Bike
|
||||
- `sensor.bike_state_of_charge`: Battery %
|
||||
- `sensor.bike_total_distance`: Total km
|
||||
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
|
||||
|
||||
#### 4. Uptime Monitoring (UptimeRobot)
|
||||
- `sensor.blog`: blog uptime
|
||||
- `sensor.valchedrym`: Valchedram site uptime
|
||||
- `switch.blog`, `switch.valchedrym`: monitoring toggles
|
||||
|
||||
#### 5. Oral-B Toothbrush (BLE)
|
||||
- `sensor.smart_series_6000_83d3_*`: mode, pressure, sector, time
|
||||
|
||||
#### 6. Network Device Tracking (~100 devices)
|
||||
- Router-based MAC tracking (many unnamed)
|
||||
- Named: Viktor's iPhone15Pro, Anca's iPhone13Pro, Apple Watch, Amazon Fire, iRobot, Portal, Living-Room TV
|
||||
|
||||
#### 7. Media & Entertainment
|
||||
- Projector + debug bridge: unavailable (Tapo plug off)
|
||||
- Scripts: `script.start_netflix`, `script.start_stremio`
|
||||
- Scene: `scene.night` (turns off Livia + Michelle plugs)
|
||||
|
||||
### Custom Components
|
||||
- **cowboy**: Cowboy e-bike integration (HACS)
|
||||
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
|
||||
|
||||
### Integrations
|
||||
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
|
||||
|
||||
### AI / Voice Assistants
|
||||
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
|
||||
- Local voice: Piper (TTS) + Whisper (STT)
|
||||
- Google Translate TTS
|
||||
|
||||
### Automations (10)
|
||||
- Water cooler on/off scheduling (07:00 on, 00:30 off)
|
||||
- Michelle plug auto-off when idle (<70W)
|
||||
- Apollo AIR-1 RGB LED: CO2 indicator (on in morning, off at 22:00)
|
||||
- Cowboy e-bike low battery notification (ntfy + iPhone push)
|
||||
- Anca arrival/departure notifications
|
||||
- Night scene: turns off Livia + Michelle
|
||||
|
||||
### Docker Setup
|
||||
```bash
|
||||
docker run -d --name homeassistant --privileged \
|
||||
-e TZ=Europe/London \
|
||||
-v /home/pi/docker/homeAssistant:/config \
|
||||
-v /run/dbus:/run/dbus:ro \
|
||||
--network=host --restart=unless-stopped \
|
||||
homeassistant/home-assistant:2025.9
|
||||
```
|
||||
|
||||
### SSH Access
|
||||
```bash
|
||||
# Read config
|
||||
ssh hassio@192.168.8.103 "sudo cat /config/configuration.yaml"
|
||||
|
||||
# Check logs
|
||||
ssh hassio@192.168.8.103 "sudo docker logs homeassistant --tail 50"
|
||||
|
||||
# Restart HA via API (preferred)
|
||||
curl -s -X POST "http://192.168.8.103:8123/api/services/homeassistant/restart" \
|
||||
-H "Authorization: Bearer ${HOME_ASSISTANT_LONDON_TOKEN}"
|
||||
|
||||
# View Docker logs
|
||||
ssh hassio@192.168.8.103 "sudo docker logs homeassistant --tail 50"
|
||||
```
|
||||
|
|
@ -1,151 +0,0 @@
|
|||
---
|
||||
name: k8s-ndots-search-domain-nxdomain-flood
|
||||
description: |
|
||||
Fix for massive NxDomain query floods to external DNS servers caused by Kubernetes
|
||||
ndots:5 search domain expansion. Use when: (1) DNS server shows low cache hit rate
|
||||
with 60%+ NxDomain responses, (2) DNS logs show queries like
|
||||
"service.namespace.svc.cluster.local.yourdomain.lan", (3) external DNS receives
|
||||
thousands of junk queries per hour for non-existent names ending in your search
|
||||
domain, (4) DNS cache hit ratio is unexpectedly low despite stable workloads.
|
||||
Applies to any Kubernetes cluster using CoreDNS with a custom DNS search domain.
|
||||
author: Claude Code
|
||||
version: 1.1.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Kubernetes ndots:5 Search Domain NxDomain Flood
|
||||
|
||||
## Problem
|
||||
Kubernetes pods have `ndots:5` and a custom search domain (e.g., `viktorbarzin.lan`)
|
||||
in their `/etc/resolv.conf`. When resolving internal service names like
|
||||
`redis.redis.svc.cluster.local` (4 dots < ndots:5), glibc tries all search domain
|
||||
suffixes before the absolute name. This generates queries like:
|
||||
|
||||
1. `redis.redis.svc.cluster.local.namespace.svc.cluster.local` (CoreDNS handles, NxDomain)
|
||||
2. `redis.redis.svc.cluster.local.svc.cluster.local` (CoreDNS handles, NxDomain)
|
||||
3. `redis.redis.svc.cluster.local.cluster.local` (CoreDNS handles, NxDomain)
|
||||
4. `redis.redis.svc.cluster.local.yourdomain.lan` (CoreDNS **forwards to external DNS**, NxDomain)
|
||||
5. `redis.redis.svc.cluster.local` (finally resolves)
|
||||
|
||||
Step 4 is the problem: CoreDNS forwards `*.yourdomain.lan` queries to the external DNS
|
||||
server, flooding it with junk NxDomain requests. With hundreds of pods making DNS lookups,
|
||||
this generates tens of thousands of useless queries per day.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- DNS server (e.g., Technitium, Pi-hole, BIND) shows high NxDomain percentage (50%+)
|
||||
- DNS cache hit rate is unexpectedly low
|
||||
- DNS logs show queries ending in `*.svc.cluster.local.yourdomain.lan`
|
||||
- CoreDNS Corefile has a server block forwarding `yourdomain.lan` to an external DNS
|
||||
- Node resolv.conf has `search yourdomain.lan` (set by DHCP)
|
||||
- Top DNS clients by query volume are Kubernetes node IPs (not pod IPs), because
|
||||
CoreDNS forwards via NodePort and the source IP becomes the node IP
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Confirm the problem
|
||||
Check DNS query logs for the pattern:
|
||||
```bash
|
||||
# Enable Technitium query logging temporarily
|
||||
# API: /api/settings/set?token=TOKEN&enableLogging=true&logQueries=true&loggingType=File
|
||||
|
||||
# Check for junk queries
|
||||
kubectl exec -n technitium PODNAME -- grep "cluster.local.yourdomain" /etc/dns/logs/*.log
|
||||
```
|
||||
|
||||
### Step 2: Add generic CoreDNS template regex (RECOMMENDED)
|
||||
|
||||
Instead of creating specific catch-all blocks for each junk suffix pattern, add a single
|
||||
`template` directive with a regex inside the `yourdomain.lan` server block. This catches
|
||||
ALL multi-label junk queries (e.g., `*.cluster.local.yourdomain.lan`,
|
||||
`*.yourdomain.lan.yourdomain.lan`, `www.cloudflare.com.yourdomain.lan`) in one rule:
|
||||
|
||||
```
|
||||
yourdomain.lan:53 {
|
||||
errors
|
||||
template ANY ANY yourdomain.lan {
|
||||
match ".*\..*\.yourdomain\.lan\.$"
|
||||
rcode NXDOMAIN
|
||||
fallthrough
|
||||
}
|
||||
forward . <your-dns-server-ip>
|
||||
cache {
|
||||
success 10000 300 6
|
||||
denial 10000 300 60
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**How it works**: The regex `.*\..*\.yourdomain\.lan\.$` matches any query with 2+ labels
|
||||
before `.yourdomain.lan` — meaning only single-label queries like `idrac.yourdomain.lan`
|
||||
fall through to the real DNS server. All junk multi-label queries get instant NXDOMAIN.
|
||||
|
||||
**Important**: The `fallthrough` directive is required so that legitimate single-label
|
||||
queries (which don't match the regex) continue to the `forward` plugin.
|
||||
|
||||
#### Alternative: Specific catch-all blocks (DEPRECATED)
|
||||
|
||||
The older approach used separate server blocks per junk suffix pattern:
|
||||
|
||||
```
|
||||
cluster.local.yourdomain.lan:53 {
|
||||
errors
|
||||
template ANY ANY {
|
||||
rcode NXDOMAIN
|
||||
}
|
||||
cache {
|
||||
denial 10000 3600
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This requires adding a new block for each pattern and doesn't catch arbitrary junk queries
|
||||
like `www.cloudflare.com.yourdomain.lan`. The generic regex approach above is preferred.
|
||||
|
||||
### Step 3: Apply the CoreDNS ConfigMap
|
||||
```bash
|
||||
kubectl apply -f coredns-configmap.yaml
|
||||
# CoreDNS auto-reloads via the `reload` plugin (default 30s)
|
||||
```
|
||||
|
||||
### Step 4: Manage in Terraform (this cluster)
|
||||
The CoreDNS ConfigMap is managed in `modules/kubernetes/technitium/main.tf` as
|
||||
`kubernetes_config_map.coredns`. To import an existing ConfigMap:
|
||||
```bash
|
||||
terraform import 'module.kubernetes_cluster.module.technitium["technitium"].kubernetes_config_map.coredns' 'kube-system/coredns'
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Test that the template returns NXDOMAIN instantly:
|
||||
```bash
|
||||
kubectl run dns-test --rm -i --restart=Never --image=busybox -- \
|
||||
nslookup redis.redis.svc.cluster.local.yourdomain.lan 10.96.0.10
|
||||
# Should return NXDOMAIN immediately
|
||||
```
|
||||
|
||||
2. Check DNS logs - no more `*.cluster.local.yourdomain.lan` queries to external DNS
|
||||
3. NxDomain percentage on external DNS should drop significantly within an hour
|
||||
|
||||
## Additional Fix: Enable DNS Cache Persistence
|
||||
If the DNS server (Technitium) loses its cache on pod restart, enable `saveCache`:
|
||||
```
|
||||
/api/settings/set?token=TOKEN&saveCache=true
|
||||
```
|
||||
This prevents the cache hit rate from resetting to zero after every restart.
|
||||
|
||||
## Notes
|
||||
- The same `ndots:5` issue also causes `*.yourdomain.lan.yourdomain.lan` (double suffix)
|
||||
and `*.yourdomain.me.yourdomain.lan` patterns — the generic regex catches all of these
|
||||
- The top DNS client IPs will be the **node IPs** (not pod IPs) because CoreDNS forwards
|
||||
via NodePort, and the source becomes the node's IP
|
||||
- `ndots:5` is the Kubernetes default and shouldn't be changed cluster-wide as it breaks
|
||||
short-name service resolution
|
||||
- Individual pods can set `dnsConfig.options: [{name: ndots, value: "2"}]` to reduce
|
||||
search domain lookups, but this is a per-pod opt-in
|
||||
- Prometheus scrape targets using `.yourdomain.lan` hostnames should add a trailing dot
|
||||
(e.g., `idrac.yourdomain.lan.:161`) to bypass ndots expansion entirely
|
||||
- ExternalName services don't need trailing dots — the generic template regex handles them
|
||||
|
||||
## See also
|
||||
- `pfsense-dnsmasq-interface-binding` — Related: preserve client IPs for DNS port forwarding
|
||||
- `crowdsec-agent-registration-failure` — another common K8s DNS-adjacent issue
|
||||
- `loki-helm-deployment-pitfalls` — Loki deployment patterns
|
||||
|
|
@ -1,194 +0,0 @@
|
|||
---
|
||||
name: pfsense
|
||||
description: |
|
||||
Manage the pfSense firewall at 10.0.20.1 via SSH. Use when:
|
||||
(1) User asks about firewall rules, NAT, port forwarding,
|
||||
(2) User asks about network diagnostics (ARP, routing, DNS, ping),
|
||||
(3) User asks about DHCP leases or static mappings,
|
||||
(4) User asks about VPN status (WireGuard, Tailscale),
|
||||
(5) User asks about pfSense services (Snort, FRR/BGP/OSPF, etc.),
|
||||
(6) User asks about firewall states, connections, or traffic,
|
||||
(7) User mentions "pfsense", "firewall", "gateway", or network troubleshooting,
|
||||
(8) User wants to check system health (CPU, memory, disk, temp) of pfSense.
|
||||
pfSense CE 2.7.2 on FreeBSD 14.0, VMID 101 on Proxmox.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-14
|
||||
---
|
||||
|
||||
# pfSense Firewall Management
|
||||
|
||||
## Overview
|
||||
- **Host**: `10.0.20.1` (Kubernetes VLAN gateway)
|
||||
- **SSH**: `ssh admin@10.0.20.1`
|
||||
- **Version**: pfSense CE 2.7.2, FreeBSD 14.0
|
||||
- **Proxmox VMID**: 101 (8 CPU, 16GB RAM, 32G disk)
|
||||
- **Web UI**: `https://pfsense.viktorbarzin.me` (via reverse proxy) or `https://10.0.20.1`
|
||||
- **Installed packages**: FRR (BGP/OSPF), Tailscale, Snort, WireGuard, REST API, FreeRADIUS
|
||||
|
||||
## Interfaces
|
||||
|
||||
| Name | Description | Physical | IP | Network |
|
||||
|------|-------------|----------|-----|---------|
|
||||
| wan | WAN | vtnet0 | 192.168.1.2/24 | Physical network |
|
||||
| lan | Management VMs | vtnet1 | 10.0.10.1/24 | VLAN 10 |
|
||||
| opt1 | Kubernetes | vtnet2 | 10.0.20.1/24 | VLAN 20 |
|
||||
| opt2 | WireGuard | tun_wg0 | 10.3.2.1/24 | VPN tunnel |
|
||||
| tailscale0 | Tailscale | tailscale0 | 100.64.0.x | Headscale mesh |
|
||||
|
||||
## CLI Script
|
||||
|
||||
**Script**: `.claude/pfsense.py`
|
||||
|
||||
### Execution Pattern
|
||||
```bash
|
||||
cd ~/code/infra && python3 .claude/pfsense.py <command> [options]
|
||||
```
|
||||
|
||||
### Available Commands
|
||||
|
||||
#### System Information
|
||||
```bash
|
||||
python3 .claude/pfsense.py status # Full system overview
|
||||
python3 .claude/pfsense.py uptime # Uptime
|
||||
python3 .claude/pfsense.py cpu # CPU info and load
|
||||
python3 .claude/pfsense.py memory # Memory breakdown
|
||||
python3 .claude/pfsense.py disk # Disk usage
|
||||
python3 .claude/pfsense.py temp # CPU temperature
|
||||
python3 .claude/pfsense.py pkg-list # Installed packages
|
||||
```
|
||||
|
||||
#### Network & Interfaces
|
||||
```bash
|
||||
python3 .claude/pfsense.py interfaces # Interface list with IPs
|
||||
python3 .claude/pfsense.py gateways # Gateway status
|
||||
python3 .claude/pfsense.py arp # ARP table
|
||||
python3 .claude/pfsense.py routes # Routing table
|
||||
python3 .claude/pfsense.py dns-resolve <host> # DNS lookup via pfSense
|
||||
python3 .claude/pfsense.py diag <host> # Ping test
|
||||
```
|
||||
|
||||
#### Firewall
|
||||
```bash
|
||||
python3 .claude/pfsense.py rules # All firewall rules
|
||||
python3 .claude/pfsense.py rules opt1 # Rules for Kubernetes interface
|
||||
python3 .claude/pfsense.py nat # NAT / port forwarding rules
|
||||
python3 .claude/pfsense.py aliases # List all aliases
|
||||
python3 .claude/pfsense.py alias <name> # Show alias members
|
||||
python3 .claude/pfsense.py states # State table summary
|
||||
python3 .claude/pfsense.py states-top 20 # Top 20 IPs by connection count
|
||||
```
|
||||
|
||||
#### DHCP
|
||||
```bash
|
||||
python3 .claude/pfsense.py dhcp-leases # All DHCP leases
|
||||
python3 .claude/pfsense.py dhcp-leases opt1 # Kubernetes network leases only
|
||||
```
|
||||
|
||||
#### Services
|
||||
```bash
|
||||
python3 .claude/pfsense.py services # List all services + status
|
||||
python3 .claude/pfsense.py service restart snort # Restart a service
|
||||
python3 .claude/pfsense.py service stop wireguard # Stop a service
|
||||
python3 .claude/pfsense.py service start wireguard # Start a service
|
||||
```
|
||||
|
||||
#### VPN & Routing
|
||||
```bash
|
||||
python3 .claude/pfsense.py wireguard # WireGuard tunnel status
|
||||
python3 .claude/pfsense.py tailscale # Tailscale/Headscale status
|
||||
python3 .claude/pfsense.py bgp # BGP summary (FRR)
|
||||
python3 .claude/pfsense.py ospf # OSPF neighbors (FRR)
|
||||
```
|
||||
|
||||
#### Security
|
||||
```bash
|
||||
python3 .claude/pfsense.py snort # Snort IDS status + recent alerts
|
||||
python3 .claude/pfsense.py logs # Last 50 firewall log entries
|
||||
python3 .claude/pfsense.py logs 200 # Last 200 entries
|
||||
python3 .claude/pfsense.py logs-filter "blocked" # Search logs
|
||||
```
|
||||
|
||||
#### Advanced
|
||||
```bash
|
||||
python3 .claude/pfsense.py pfctl "-sr" # Raw pfctl command
|
||||
python3 .claude/pfsense.py php "echo phpversion();" # Run PHP on pfSense
|
||||
python3 .claude/pfsense.py raw "ls /tmp" # Run arbitrary shell command
|
||||
python3 .claude/pfsense.py backup # Dump config.xml to stdout
|
||||
```
|
||||
|
||||
## Direct SSH Access
|
||||
|
||||
For tasks not covered by the script, SSH directly:
|
||||
```bash
|
||||
ssh admin@10.0.20.1 "<command>"
|
||||
```
|
||||
|
||||
### Useful Direct Commands
|
||||
```bash
|
||||
# pfSense PHP shell (interactive config access)
|
||||
ssh admin@10.0.20.1 "php -r 'require_once(\"config.inc\"); \$cfg = parse_config(true); echo json_encode(\$cfg[\"nat\"], JSON_PRETTY_PRINT);'"
|
||||
|
||||
# pfSsh.php playback commands
|
||||
ssh admin@10.0.20.1 "pfSsh.php playback gatewaystatus"
|
||||
ssh admin@10.0.20.1 "pfSsh.php playback svc restart snort"
|
||||
ssh admin@10.0.20.1 "pfSsh.php playback listpkg"
|
||||
|
||||
# Config sections via PHP
|
||||
ssh admin@10.0.20.1 "php -r 'require_once(\"config.inc\"); \$cfg = parse_config(true); print_r(\$cfg[\"filter\"][\"rule\"][0]);'"
|
||||
|
||||
# FRR/vtysh for routing
|
||||
ssh admin@10.0.20.1 "/usr/local/bin/vtysh -c 'show ip route'"
|
||||
ssh admin@10.0.20.1 "/usr/local/bin/vtysh -c 'show bgp ipv4 unicast'"
|
||||
```
|
||||
|
||||
## REST API (pfSense-pkg-RESTAPI v2.2)
|
||||
|
||||
The REST API package is installed but **no API keys are configured**. To use it:
|
||||
1. Create an API key in pfSense Web UI: System > REST API > Settings > Keys
|
||||
2. Use Bearer token auth: `curl -sk https://10.0.20.1/api/v2/status/system -H 'Authorization: Bearer <key>'`
|
||||
|
||||
Until API keys are set up, use SSH for all operations.
|
||||
|
||||
## Key Services
|
||||
|
||||
| Service | Status | Notes |
|
||||
|---------|--------|-------|
|
||||
| FRR (BGP/OSPF) | Running | Routing daemon |
|
||||
| Snort | Running | IDS/IPS |
|
||||
| WireGuard | Running | VPN tunnel (10.3.2.0/24) |
|
||||
| Tailscale | Running | Mesh VPN via Headscale |
|
||||
| FreeRADIUS | Running | RADIUS auth |
|
||||
| DHCP (Kea) | Running | kea-dhcp4 |
|
||||
| SSH | Running | Admin access |
|
||||
| NTP | Running | Time sync |
|
||||
|
||||
## Firewall Stats
|
||||
- **167 firewall rules** (pfctl -sr)
|
||||
- **154 NAT rules** (pfctl -sn)
|
||||
- **~784 active states** (varies)
|
||||
- **10 aliases** (LAN, OPT1, OPT2, WAN networks + custom)
|
||||
|
||||
## NFS Backup
|
||||
Config backups stored at NFS: `/mnt/main/pfsense-backup`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Issue | Command |
|
||||
|-------|---------|
|
||||
| Can't reach internet from K8s | `python3 .claude/pfsense.py gateways` + `python3 .claude/pfsense.py diag 8.8.8.8` |
|
||||
| K8s pod can't reach external | `python3 .claude/pfsense.py rules opt1` + check NAT |
|
||||
| DHCP not working | `python3 .claude/pfsense.py dhcp-leases opt1` + `python3 .claude/pfsense.py service restart kea-dhcp4` |
|
||||
| High connection count | `python3 .claude/pfsense.py states-top 20` |
|
||||
| Snort blocking traffic | `python3 .claude/pfsense.py snort` + check alerts |
|
||||
| DNS resolution failing | `python3 .claude/pfsense.py dns-resolve <host>` |
|
||||
| BGP/OSPF routes missing | `python3 .claude/pfsense.py bgp` or `python3 .claude/pfsense.py ospf` |
|
||||
| WireGuard tunnel down | `python3 .claude/pfsense.py wireguard` |
|
||||
|
||||
## Notes
|
||||
1. **FreeBSD-based**: Commands differ from Linux (no `ip`, use `ifconfig`, `netstat`, `arp`)
|
||||
2. **pfctl is the firewall**: Rules loaded from config.xml via PHP, managed by pfctl
|
||||
3. **Config file**: `/cf/conf/config.xml` — all pfSense config in one XML file
|
||||
4. **PHP shell**: pfSense uses PHP for all config management; `config.inc` loads the config
|
||||
5. **Do NOT edit config.xml directly** — use the Web UI or PHP functions that properly reload services
|
||||
6. **Logs**: Binary circular logs, read with `clog -f /var/log/<logfile>`
|
||||
|
|
@ -1,78 +0,0 @@
|
|||
# Post-Mortem Writer
|
||||
|
||||
Generate a structured post-mortem document after an incident mitigation session.
|
||||
|
||||
## When to use
|
||||
- After `/post-mortem` command
|
||||
- Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY
|
||||
|
||||
## Instructions
|
||||
|
||||
1. **Gather context**:
|
||||
- Run `.claude/scripts/sev-context.sh` to capture current cluster state
|
||||
- Review the conversation history for: what broke, timeline, root cause, what was fixed
|
||||
- Check existing post-mortems at `docs/post-mortems/` for format reference
|
||||
|
||||
2. **Generate the post-mortem**:
|
||||
- Use the template at `.claude/skills/post-mortem/template.md`
|
||||
- Fill in all sections from the investigation context
|
||||
- **Critical**: In the Prevention Plan tables, set the `Type` column correctly:
|
||||
- `Alert` — add/modify Prometheus alerting rules (auto-implementable)
|
||||
- `Config` — change Terraform config, NFS options, etc. (auto-implementable)
|
||||
- `Monitor` — add Uptime Kuma monitors (auto-implementable)
|
||||
- `Architecture` — storage migration, stack redesign (human-only)
|
||||
- `Investigation` — needs further research (human-only)
|
||||
- `Runbook` — document a procedure (human-only)
|
||||
- `Migration` — data or service migration (human-only)
|
||||
- Items already fixed during the session should have Status = `Done`
|
||||
- Items not yet done should have Status = `TODO`
|
||||
|
||||
3. **File naming**: `docs/post-mortems/<YYYY-MM-DD>-<slug>.md`
|
||||
- Slug: lowercase, hyphenated, max 5 words describing the incident
|
||||
|
||||
4. **Update index**: Add an entry to `docs/post-mortems/index.html`
|
||||
- Add a new card in the incidents grid with date, severity tag, title, description
|
||||
|
||||
5. **Link to GitHub Issue** (if an issue exists for this incident):
|
||||
- Fill in the `Issue` field in the template metadata table with `[#N](https://github.com/ViktorBarzin/infra/issues/N)`
|
||||
- Add a comment to the GitHub Issue linking the postmortem:
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
-H "Accept: application/vnd.github.v3+json" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
|
||||
-d '{"body": "**Postmortem:** [View postmortem](https://viktorbarzin.github.io/infra/post-mortems/<YYYY-MM-DD>-<slug>)"}'
|
||||
```
|
||||
- Add the `postmortem-done` label and remove `postmortem-required`:
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" \
|
||||
-d '{"labels": ["postmortem-done"]}'
|
||||
curl -s -X DELETE \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels/postmortem-required"
|
||||
```
|
||||
- If no issue exists, create one with labels `incident`, `sev<N>`, `postmortem-done`
|
||||
|
||||
6. **Commit and push**:
|
||||
```
|
||||
git add docs/post-mortems/<file>.md docs/post-mortems/index.html
|
||||
git commit -m "docs: post-mortem for <date> <title> [ci skip]"
|
||||
git push origin master
|
||||
```
|
||||
- Use `[ci skip]` to avoid triggering app-stacks pipeline
|
||||
- NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter)
|
||||
|
||||
## Type Reference for Prevention Plan
|
||||
|
||||
| Type | Auto-implementable? | Examples |
|
||||
|------|---------------------|----------|
|
||||
| Alert | Yes | Add PrometheusRule, modify alert thresholds |
|
||||
| Config | Yes | Change Terraform variables, mount options, CronJob schedules |
|
||||
| Monitor | Yes | Add Uptime Kuma HTTP/TCP monitor |
|
||||
| Architecture | No | Migrate storage class, redesign HA topology |
|
||||
| Investigation | No | Research kernel bug, check Proxmox forum |
|
||||
| Runbook | No | Document recovery procedure |
|
||||
| Migration | No | Move data between storage backends |
|
||||
|
|
@ -1,86 +0,0 @@
|
|||
# Post-Mortem: <TITLE>
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | <DATE> |
|
||||
| **Duration** | <DURATION> |
|
||||
| **Severity** | <SEV1/SEV2/SEV3> |
|
||||
| **Affected Services** | <COUNT> pods across <COUNT> namespaces |
|
||||
| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |
|
||||
| **Status** | Draft |
|
||||
|
||||
## Summary
|
||||
|
||||
<1-2 sentence summary of the incident.>
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: <What users experienced>
|
||||
- **Blast radius**: <How many services/pods/namespaces affected>
|
||||
- **Duration**: <How long the outage lasted>
|
||||
- **Data loss**: <None/details>
|
||||
- **Monitoring gap**: <Any blind spots in alerting>
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| **HH:MM** | <First sign of trouble> |
|
||||
| **HH:MM** | <Detection / user report> |
|
||||
| **HH:MM** | <Investigation begins> |
|
||||
| **HH:MM** | <Root cause identified> |
|
||||
| **HH:MM** | <Fix applied> |
|
||||
| **HH:MM** | <Service restored> |
|
||||
|
||||
## Root Cause
|
||||
|
||||
<Narrative description of what went wrong and why.>
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
1. <Factor that made the incident worse or harder to detect>
|
||||
2. <Factor...>
|
||||
|
||||
## Detection Gaps
|
||||
|
||||
| Gap | Impact | Fix |
|
||||
|-----|--------|-----|
|
||||
| <What wasn't monitored> | <How it delayed detection> | <What to add> |
|
||||
|
||||
## Prevention Plan
|
||||
|
||||
### P0 — Prevent this exact failure
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P0 | <action> | Config | <details> | TODO |
|
||||
|
||||
### P1 — Reduce blast radius
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P1 | <action> | Alert | <details> | TODO |
|
||||
|
||||
### P2 — Detect faster
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P2 | <action> | Monitor | <details> | TODO |
|
||||
|
||||
### P3 — Improve resilience
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P3 | <action> | Architecture | <details> | TODO |
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. <Key takeaway>
|
||||
2. <Key takeaway>
|
||||
|
||||
## Follow-up Implementation
|
||||
|
||||
_This section is auto-populated by the postmortem-todo-resolver agent._
|
||||
|
||||
| Date | Action | Priority | Type | Commit | Implemented By |
|
||||
|------|--------|----------|------|--------|----------------|
|
||||
|
|
@ -1,522 +0,0 @@
|
|||
---
|
||||
name: setup-project
|
||||
description: |
|
||||
Deploy a new self-hosted service to the Kubernetes cluster from a GitHub repository.
|
||||
Use when: (1) User provides a GitHub URL or project name and wants to deploy it,
|
||||
(2) User says "deploy [service]" or "set up [service]",
|
||||
(3) User wants to add a new service to the cluster.
|
||||
Automated workflow: Docker image → Terraform module → Deploy.
|
||||
Handles database setup, ingress, DNS configuration.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-01
|
||||
---
|
||||
|
||||
# Setup Project Skill
|
||||
|
||||
**Purpose**: Deploy a new self-hosted service to the Kubernetes cluster from a GitHub repository.
|
||||
|
||||
**When to use**: User provides a GitHub URL or project name and wants to deploy it to the cluster.
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Research Phase
|
||||
|
||||
**Input**: GitHub repository URL or project name
|
||||
|
||||
**Actions**:
|
||||
- Visit the GitHub repository
|
||||
- Check the README for:
|
||||
- Official Docker image (Docker Hub, ghcr.io, etc.)
|
||||
- docker-compose.yml file
|
||||
- Self-hosting documentation
|
||||
- Required dependencies (PostgreSQL, MySQL, Redis, etc.)
|
||||
- Environment variables needed
|
||||
- Default ports
|
||||
- Storage requirements
|
||||
|
||||
**Find Docker Image Priority**:
|
||||
1. Check official documentation for recommended image
|
||||
2. Look in docker-compose.yml for `image:` directive
|
||||
3. Check GitHub Container Registry: `ghcr.io/<org>/<repo>`
|
||||
4. Check Docker Hub: `<org>/<repo>`
|
||||
5. Check releases page for container images
|
||||
6. Last resort: Build from Dockerfile (avoid if possible)
|
||||
|
||||
**Classify Dockerfile State** (drives whether we contribute a PR back upstream later):
|
||||
|
||||
| State | When | Action on deploy success |
|
||||
|---|---|---|
|
||||
| `image-used` | An official/community image worked (priority 1-5). | No upstream PR. Default case. |
|
||||
| `used-as-is` | Upstream ships a Dockerfile; it built and ran fine. | No upstream PR. |
|
||||
| `fixed-broken-upstream` | Upstream Dockerfile exists but fails to build / run; we patched it. | Open a `fix-dockerfile` PR after stability gate. |
|
||||
| `written-from-scratch` | Upstream has no Dockerfile at all; we authored one. | Open an `add-dockerfile` PR after stability gate. |
|
||||
|
||||
Record the chosen state and supporting metadata in `modules/kubernetes/<service>/.contribution-state.json`. When we author or fix a Dockerfile, also write `modules/kubernetes/<service>/files/Dockerfile`, `.dockerignore`, and `BUILD.md` (from `templates/Dockerfile.README.md`) — these travel with the upstream PR.
|
||||
|
||||
```json
|
||||
{
|
||||
"upstream_repo": "owner/name",
|
||||
"dockerfile_state": "written-from-scratch",
|
||||
"dockerfile_path_in_infra": "modules/kubernetes/<service>/files/Dockerfile",
|
||||
"deploy_target_url": "https://<service>.viktorbarzin.me",
|
||||
"image_tag": "registry.viktorbarzin.me/<service>:<sha>",
|
||||
"image_size": "<MB>",
|
||||
"base_image": "<e.g. python:3.12-slim>",
|
||||
"dockerfile_shape": "multi-stage, non-root, linux/amd64",
|
||||
"deploy_verified_at": null,
|
||||
"contribution_pr_url": null
|
||||
}
|
||||
```
|
||||
|
||||
**Dockerfile quality bar** (when writing one ourselves — enforced before PR):
|
||||
- Multi-stage build where it makes sense (Node, Go, Rust, Python with compiled deps).
|
||||
- Explicit non-root `USER`.
|
||||
- `HEALTHCHECK` when the app exposes a known endpoint.
|
||||
- Minimal base image (alpine / distroless preferred; `-slim` otherwise).
|
||||
- No secrets baked in; runtime config via `ENV`.
|
||||
- `.dockerignore` that excludes `.git`, `node_modules`, test artifacts.
|
||||
|
||||
**Extract Configuration**:
|
||||
- Container port (default port the app listens on)
|
||||
- Environment variables (DATABASE_URL, REDIS_HOST, SMTP, etc.)
|
||||
- Volume mounts (what data needs persistence)
|
||||
- Dependencies (database type, cache, etc.)
|
||||
|
||||
### 2. Database Setup (if needed)
|
||||
|
||||
**If project requires PostgreSQL**:
|
||||
- User provides database credentials or use pattern: `<service>` user with secure password
|
||||
- Database will be created in shared `postgresql.dbaas.svc.cluster.local`
|
||||
- Connection string format: `postgresql://<user>:<password>@postgresql.dbaas.svc.cluster.local:5432/<dbname>`
|
||||
|
||||
**If project requires MySQL**:
|
||||
- User provides database credentials
|
||||
- Database in shared `mysql.dbaas.svc.cluster.local`
|
||||
- Connection string format: `mysql://<user>:<password>@mysql.dbaas.svc.cluster.local:3306/<dbname>`
|
||||
|
||||
**If project requires Redis**:
|
||||
- Use shared Redis: `redis.redis.svc.cluster.local:6379`
|
||||
- No password required
|
||||
|
||||
**IMPORTANT**: Never create databases yourself - always ask user for credentials to use.
|
||||
|
||||
### 3. NFS Storage Setup (if service needs persistent data)
|
||||
|
||||
**IMPORTANT**: NFS directories must exist and be exported on the NFS server BEFORE deploying the service. If the directory doesn't exist, the pod will fail to mount the volume and get stuck in `ContainerCreating`.
|
||||
|
||||
**Steps**:
|
||||
|
||||
1. **Create the directory on the NFS server**:
|
||||
```bash
|
||||
ssh root@10.0.10.15 'mkdir -p /mnt/main/<service> && chmod 777 /mnt/main/<service>'
|
||||
```
|
||||
|
||||
2. **Export the directory via TrueNAS**:
|
||||
- The NFS export must be configured in TrueNAS so Kubernetes nodes can mount it
|
||||
- Create the export via TrueNAS WebUI or API, allowing access from the Kubernetes network (10.0.20.0/24)
|
||||
- Verify the export is accessible:
|
||||
```bash
|
||||
# From a k8s node or the dev VM
|
||||
showmount -e 10.0.10.15 | grep <service>
|
||||
```
|
||||
|
||||
3. **Verify the mount works before proceeding**:
|
||||
```bash
|
||||
# Quick test from a k8s node
|
||||
ssh root@10.0.20.100 'mount -t nfs 10.0.10.15:/mnt/main/<service> /tmp/test-mount && ls /tmp/test-mount && umount /tmp/test-mount'
|
||||
```
|
||||
|
||||
**Only proceed to Terraform module creation after confirming the NFS export is accessible.**
|
||||
|
||||
### 4. Terraform Module Creation
|
||||
|
||||
**Create module directory**:
|
||||
```bash
|
||||
mkdir -p modules/kubernetes/<service-name>/
|
||||
```
|
||||
|
||||
**Create `modules/kubernetes/<service-name>/main.tf`**:
|
||||
|
||||
```hcl
|
||||
variable "tls_secret_name" {}
|
||||
variable "tier" { type = string }
|
||||
variable "postgresql_password" {} # Only if needed
|
||||
# Add other variables as needed (smtp_password, api_keys, etc.)
|
||||
|
||||
resource "kubernetes_namespace" "<service>" {
|
||||
metadata {
|
||||
name = "<service>"
|
||||
}
|
||||
}
|
||||
|
||||
module "tls_secret" {
|
||||
source = "../setup_tls_secret"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
|
||||
# If database migrations needed, add init_container
|
||||
resource "kubernetes_deployment" "<service>" {
|
||||
metadata {
|
||||
name = "<service>"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
labels = {
|
||||
app = "<service>"
|
||||
tier = var.tier
|
||||
}
|
||||
}
|
||||
spec {
|
||||
replicas = 1
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "<service>"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "<service>"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
# Init container for migrations (if needed)
|
||||
# init_container { ... }
|
||||
|
||||
container {
|
||||
name = "<service>"
|
||||
image = "<docker-image>:<tag>"
|
||||
|
||||
port {
|
||||
container_port = <port>
|
||||
}
|
||||
|
||||
# Environment variables
|
||||
env {
|
||||
name = "DATABASE_URL"
|
||||
value = "postgresql://<service>:${var.postgresql_password}@postgresql.dbaas.svc.cluster.local:5432/<service>"
|
||||
}
|
||||
# Add other env vars as needed
|
||||
|
||||
# Volume mounts for persistent data
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "<mount-path>"
|
||||
sub_path = "<optional-subpath>"
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
memory = "256Mi"
|
||||
cpu = "100m"
|
||||
}
|
||||
limits = {
|
||||
memory = "2Gi"
|
||||
cpu = "1"
|
||||
}
|
||||
}
|
||||
|
||||
# Health checks (if endpoints exist)
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/health" # or /healthz, /, etc.
|
||||
port = <port>
|
||||
}
|
||||
initial_delay_seconds = 60
|
||||
period_seconds = 30
|
||||
}
|
||||
}
|
||||
|
||||
# NFS volume for persistence
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = "10.0.10.15"
|
||||
path = "/mnt/main/<service>"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "<service>" {
|
||||
metadata {
|
||||
name = "<service>"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
labels = {
|
||||
app = "<service>"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
selector = {
|
||||
app = "<service>"
|
||||
}
|
||||
port {
|
||||
name = "http"
|
||||
port = 80
|
||||
target_port = <container-port>
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# Add extra_annotations if needed (proxy-body-size, timeouts, etc.)
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Update Main Terraform Files
|
||||
|
||||
**Add to `modules/kubernetes/main.tf`**:
|
||||
|
||||
1. Add variable declarations at top:
|
||||
```hcl
|
||||
variable "<service>_postgresql_password" { type = string }
|
||||
```
|
||||
|
||||
2. Add to appropriate DEFCON level (ask user which level, default to 5):
|
||||
```hcl
|
||||
5 : [
|
||||
...,
|
||||
"<service>"
|
||||
]
|
||||
```
|
||||
|
||||
3. Add module block at bottom:
|
||||
```hcl
|
||||
module "<service>" {
|
||||
source = "./<service>"
|
||||
for_each = contains(local.active_modules, "<service>") ? { <service> = true } : {}
|
||||
tls_secret_name = var.tls_secret_name
|
||||
postgresql_password = var.<service>_postgresql_password
|
||||
tier = local.tiers.aux # or appropriate tier
|
||||
|
||||
depends_on = [null_resource.core_services]
|
||||
}
|
||||
```
|
||||
|
||||
**Add to `main.tf`**:
|
||||
|
||||
1. Add variable:
|
||||
```hcl
|
||||
variable "<service>_postgresql_password" { type = string }
|
||||
```
|
||||
|
||||
2. Pass to kubernetes_cluster module:
|
||||
```hcl
|
||||
module "kubernetes_cluster" {
|
||||
...
|
||||
<service>_postgresql_password = var.<service>_postgresql_password
|
||||
}
|
||||
```
|
||||
|
||||
**Update `terraform.tfvars`**:
|
||||
|
||||
1. Add password/credentials:
|
||||
```hcl
|
||||
<service>_postgresql_password = "<secure-password>"
|
||||
```
|
||||
|
||||
2. Add to Cloudflare DNS (ask user if proxied or non-proxied):
|
||||
```hcl
|
||||
cloudflare_non_proxied_names = [
|
||||
...,
|
||||
"<service>"
|
||||
]
|
||||
```
|
||||
|
||||
### 6. Email/SMTP Configuration (if needed)
|
||||
|
||||
If service needs to send emails:
|
||||
```hcl
|
||||
env {
|
||||
name = "MAILER_HOST"
|
||||
value = "mailserver.viktorbarzin.me" # Public hostname for TLS
|
||||
}
|
||||
env {
|
||||
name = "MAILER_PORT"
|
||||
value = "587"
|
||||
}
|
||||
env {
|
||||
name = "MAILER_USER"
|
||||
value = "info@viktorbarzin.me"
|
||||
}
|
||||
env {
|
||||
name = "MAILER_PASSWORD"
|
||||
value = var.mailserver_accounts["info@viktorbarzin.me"] # Pass from module
|
||||
}
|
||||
```
|
||||
|
||||
Add to module call:
|
||||
```hcl
|
||||
smtp_password = var.mailserver_accounts["info@viktorbarzin.me"]
|
||||
```
|
||||
|
||||
### 7. Apply Terraform
|
||||
|
||||
```bash
|
||||
terraform init
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
**IMPORTANT: Also apply the cloudflared module to create the Cloudflare DNS record:**
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.cloudflared -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
Without this step, the DNS record won't be created even though it's defined in `terraform.tfvars`.
|
||||
|
||||
### 8. Verification
|
||||
|
||||
```bash
|
||||
kubectl get pods -n <service>
|
||||
kubectl logs -n <service> -l app=<service> --tail=50
|
||||
```
|
||||
|
||||
Test URL: `https://<service>.viktorbarzin.me`
|
||||
|
||||
### 8b. Stability Gate (required when `dockerfile_state ∈ {written-from-scratch, fixed-broken-upstream}`)
|
||||
|
||||
Before committing — and before any upstream PR in §10 — run a 10-minute stability check to catch pods that crash-loop a few minutes after Ready.
|
||||
|
||||
```bash
|
||||
.claude/skills/setup-project/scripts/stability-gate.sh <service> <service> https://<service>.viktorbarzin.me
|
||||
```
|
||||
|
||||
Polls pod readiness + `curl` 200 every 30s × 20 iterations. Requires 18/20 successes (tolerates 2 blips).
|
||||
|
||||
- **Pass** → update the state file: `jq '.deploy_verified_at = (now | todate)' .contribution-state.json | sponge .contribution-state.json` → proceed to §9 and §10.
|
||||
- **Fail** → stop. Investigate via `kubectl logs`, `kubectl describe`. Do NOT commit. Do NOT fire §10. Re-run the gate after fixes.
|
||||
|
||||
For `image-used` / `used-as-is` states, the gate is optional (app is already running a known-good image).
|
||||
|
||||
### 9. Commit Changes
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/<service>/ main.tf modules/kubernetes/main.tf terraform.tfvars
|
||||
git commit -m "Add <service> deployment
|
||||
|
||||
- Deploy <service> as <description>
|
||||
- Uses <dependencies>
|
||||
- Ingress at <service>.viktorbarzin.me
|
||||
|
||||
[ci skip]"
|
||||
```
|
||||
|
||||
### 10. Contribute Dockerfile Upstream (only when `dockerfile_state ∈ {written-from-scratch, fixed-broken-upstream}`)
|
||||
|
||||
Goal: give the community the working Dockerfile we just validated in production.
|
||||
|
||||
**Preconditions** (script enforces):
|
||||
- `.contribution-state.json` present with a trigger state and `deploy_verified_at` set.
|
||||
- `files/Dockerfile`, `files/.dockerignore`, `files/BUILD.md` exist next to the module.
|
||||
- `GITHUB_TOKEN` in env — or `vault kv get -field=github_pat secret/viktor` is reachable.
|
||||
|
||||
**Run**:
|
||||
```bash
|
||||
.claude/skills/setup-project/scripts/contribute-dockerfile.sh modules/kubernetes/<service>
|
||||
```
|
||||
|
||||
**What the script does** (all via GitHub REST — `gh` CLI is sandbox-blocked):
|
||||
1. Reads `.contribution-state.json`; skips unless state is `written-from-scratch` or `fixed-broken-upstream` and no `contribution_pr_url` is already recorded.
|
||||
2. Upstream sanity checks: repo exists, public, not archived; default branch discoverable; for `written-from-scratch`, verifies a `Dockerfile` didn't land upstream while we were deploying; bails cleanly if an open PR from our fork already exists.
|
||||
3. `POST /repos/<owner>/<name>/forks` — idempotent; waits up to 30s for the fork to be ready at `ViktorBarzin/<name>`.
|
||||
4. `POST /repos/ViktorBarzin/<name>/merge-upstream` — keeps fork current with upstream default branch.
|
||||
5. Creates branch `add-dockerfile` (or `fix-dockerfile`), timestamp-suffixed if that branch already exists with unrelated commits.
|
||||
6. Commits `Dockerfile`, `.dockerignore`, `BUILD.md` via Contents API. Each commit message carries `Signed-off-by:` for DCO-enforcing repos.
|
||||
7. Opens PR against upstream with body rendered from `templates/PR_BODY.md`.
|
||||
8. Writes `contribution_pr_url` back into `.contribution-state.json` and echoes the URL.
|
||||
|
||||
**Failure handling**:
|
||||
- Upstream archived / private / deleted → logged as SKIP, deploy success stands.
|
||||
- Fork/branch/PR already exists → treated as idempotent success; existing URL recorded.
|
||||
- GitHub 5xx → 3× exponential backoff, then hard fail with a clear message — safe to re-run the script.
|
||||
|
||||
**After the PR opens**: the URL is in `.contribution-state.json`. Share it with the user. No automated follow-up on merge/reject — that's a manual check for now.
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Init Container for Migrations
|
||||
```hcl
|
||||
init_container {
|
||||
name = "migration"
|
||||
image = "<same-image>"
|
||||
command = ["sh", "-c", "<migration-command>"]
|
||||
|
||||
# Same env vars and volumes as main container
|
||||
}
|
||||
```
|
||||
|
||||
### Dynamic Environment Variables
|
||||
```hcl
|
||||
locals {
|
||||
common_env = [
|
||||
{ name = "VAR1", value = "value1" },
|
||||
{ name = "VAR2", value = "value2" },
|
||||
]
|
||||
}
|
||||
|
||||
dynamic "env" {
|
||||
for_each = local.common_env
|
||||
content {
|
||||
name = env.value.name
|
||||
value = env.value.value
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### External URL Configuration
|
||||
Many apps need their public URL configured:
|
||||
```hcl
|
||||
env {
|
||||
name = "APP_URL" # or PUBLIC_URL, EXTERNAL_URL, etc.
|
||||
value = "https://<service>.viktorbarzin.me"
|
||||
}
|
||||
env {
|
||||
name = "HTTPS" # or ENABLE_HTTPS, etc.
|
||||
value = "true"
|
||||
}
|
||||
```
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Find official Docker image or docker-compose
|
||||
- [ ] Identify dependencies (DB, Redis, etc.)
|
||||
- [ ] Ask user for database credentials (never create yourself)
|
||||
- [ ] Create NFS directory and export on TrueNAS (if persistent storage needed)
|
||||
- [ ] Verify NFS mount is accessible from k8s nodes
|
||||
- [ ] Create `modules/kubernetes/<service>/main.tf`
|
||||
- [ ] Classify `dockerfile_state` and write `.contribution-state.json`
|
||||
- [ ] If writing/fixing Dockerfile: satisfy the quality bar (multi-stage, non-root, `.dockerignore`, `BUILD.md`)
|
||||
- [ ] Update `modules/kubernetes/main.tf` (variables, DEFCON level, module block)
|
||||
- [ ] Update `main.tf` (variable, pass to module)
|
||||
- [ ] Update `terraform.tfvars` (password, Cloudflare DNS)
|
||||
- [ ] Run `terraform init` and `terraform apply`
|
||||
- [ ] Verify pods are running
|
||||
- [ ] Test the URL
|
||||
- [ ] Run stability-gate.sh — needed for contribution, optional otherwise
|
||||
- [ ] Commit changes with `[ci skip]`
|
||||
- [ ] Run contribute-dockerfile.sh if state triggers an upstream PR
|
||||
|
||||
## Questions to Ask User
|
||||
|
||||
1. What DEFCON level should this service be in? (Default: 5)
|
||||
2. Should Cloudflare proxy this domain? (Default: no, add to non_proxied_names)
|
||||
3. Does this need email/SMTP? (Configure if yes)
|
||||
4. What database credentials should I use? (Never create yourself)
|
||||
5. What tier? (core/cluster/gpu/edge/aux - default: aux)
|
||||
|
||||
## Notes
|
||||
|
||||
- **Always create NFS directories and exports BEFORE deploying** - pods will get stuck in `ContainerCreating` if the NFS path doesn't exist or isn't exported
|
||||
- **Always use official documentation** as the source of truth
|
||||
- **Prefer stable/latest tags** over specific versions for self-hosted
|
||||
- **Use shared infrastructure**: PostgreSQL at `postgresql.dbaas.svc.cluster.local`, Redis at `redis.redis.svc.cluster.local`
|
||||
- **NFS storage**: Always at `10.0.10.15:/mnt/main/<service>`
|
||||
- **Email**: Use `mailserver.viktorbarzin.me` (public hostname) not internal service name
|
||||
- **Resource limits**: Start conservative, can increase if needed
|
||||
- **Health checks**: Only add if the app has health endpoints
|
||||
|
|
@ -1,270 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# Contribute a working Dockerfile back to an upstream GitHub repo.
|
||||
#
|
||||
# Reads state from <service-module-dir>/.contribution-state.json and:
|
||||
# 1. Validates triggers (dockerfile_state ∈ {written-from-scratch, fixed-broken-upstream})
|
||||
# 2. Confirms upstream is public, not archived, no concurrent Dockerfile landed
|
||||
# 3. Forks upstream to ViktorBarzin (idempotent)
|
||||
# 4. Syncs fork with upstream default branch
|
||||
# 5. Creates branch (add-dockerfile or fix-dockerfile), appends -<ts> on collision
|
||||
# 6. Commits Dockerfile + .dockerignore + BUILD.md via Contents API
|
||||
# 7. Opens PR against upstream with body rendered from PR_BODY.md
|
||||
# 8. Writes contribution_pr_url back into state file
|
||||
#
|
||||
# Usage:
|
||||
# contribute-dockerfile.sh <service-module-dir>
|
||||
#
|
||||
# Example:
|
||||
# contribute-dockerfile.sh /home/wizard/code/infra/modules/kubernetes/myapp
|
||||
#
|
||||
# Requires: jq, curl, vault CLI (logged in).
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
TEMPLATES_DIR="$(cd "$SCRIPT_DIR/../templates" && pwd)"
|
||||
|
||||
FORK_OWNER="ViktorBarzin"
|
||||
|
||||
log() { echo "contribute-dockerfile: $*"; }
|
||||
die() { echo "contribute-dockerfile: ERROR: $*" >&2; exit 1; }
|
||||
skip() { echo "contribute-dockerfile: SKIP: $*"; exit 0; }
|
||||
|
||||
if [ "$#" -ne 1 ]; then
|
||||
die "usage: $0 <service-module-dir>"
|
||||
fi
|
||||
|
||||
MODULE_DIR="$1"
|
||||
STATE_FILE="$MODULE_DIR/.contribution-state.json"
|
||||
|
||||
[ -f "$STATE_FILE" ] || die "state file not found: $STATE_FILE"
|
||||
|
||||
# --- Read + validate state ---
|
||||
dockerfile_state=$(jq -r '.dockerfile_state // ""' "$STATE_FILE")
|
||||
upstream_repo=$(jq -r '.upstream_repo // ""' "$STATE_FILE")
|
||||
dockerfile_path=$(jq -r '.dockerfile_path_in_infra // ""' "$STATE_FILE")
|
||||
deploy_verified_at=$(jq -r '.deploy_verified_at // ""' "$STATE_FILE")
|
||||
existing_pr_url=$(jq -r '.contribution_pr_url // ""' "$STATE_FILE")
|
||||
|
||||
if [ -n "$existing_pr_url" ] && [ "$existing_pr_url" != "null" ]; then
|
||||
skip "PR already exists: $existing_pr_url"
|
||||
fi
|
||||
|
||||
case "$dockerfile_state" in
|
||||
written-from-scratch) BRANCH_NAME="add-dockerfile"; reason_type="none" ;;
|
||||
fixed-broken-upstream) BRANCH_NAME="fix-dockerfile"; reason_type="broken" ;;
|
||||
*) skip "dockerfile_state='$dockerfile_state' — nothing to contribute" ;;
|
||||
esac
|
||||
|
||||
[ -z "$deploy_verified_at" ] || [ "$deploy_verified_at" = "null" ] && die "deploy not verified yet (deploy_verified_at empty); run stability-gate first"
|
||||
|
||||
[ -z "$upstream_repo" ] && die "upstream_repo empty in state file"
|
||||
[[ "$upstream_repo" == */* ]] || die "upstream_repo must be owner/name, got: $upstream_repo"
|
||||
|
||||
UP_OWNER="${upstream_repo%/*}"
|
||||
UP_NAME="${upstream_repo#*/}"
|
||||
|
||||
abs_dockerfile="$MODULE_DIR/$(basename "$dockerfile_path")"
|
||||
if [ ! -f "$MODULE_DIR/files/Dockerfile" ]; then
|
||||
die "Dockerfile not found at $MODULE_DIR/files/Dockerfile"
|
||||
fi
|
||||
DOCKERFILE_SRC="$MODULE_DIR/files/Dockerfile"
|
||||
DOCKERIGNORE_SRC="$MODULE_DIR/files/.dockerignore"
|
||||
BUILDMD_SRC="$MODULE_DIR/files/BUILD.md"
|
||||
for f in "$DOCKERIGNORE_SRC" "$BUILDMD_SRC"; do
|
||||
[ -f "$f" ] || die "required file missing: $f"
|
||||
done
|
||||
|
||||
# --- GitHub auth ---
|
||||
GITHUB_TOKEN="${GITHUB_TOKEN:-$(vault kv get -field=github_pat secret/viktor 2>/dev/null || true)}"
|
||||
[ -n "$GITHUB_TOKEN" ] || die "GITHUB_TOKEN not set and vault lookup failed (vault login -method=oidc first)"
|
||||
|
||||
gh_api() {
|
||||
local method="$1"; local path="$2"; local data="${3:-}"
|
||||
local url="https://api.github.com${path}"
|
||||
local curl_args=(-sS -w "\n%{http_code}" -X "$method"
|
||||
-H "Authorization: token $GITHUB_TOKEN"
|
||||
-H "Accept: application/vnd.github+json"
|
||||
-H "X-GitHub-Api-Version: 2022-11-28")
|
||||
[ -n "$data" ] && curl_args+=(-d "$data")
|
||||
curl "${curl_args[@]}" "$url"
|
||||
}
|
||||
|
||||
gh_api_retry() {
|
||||
local method="$1"; local path="$2"; local data="${3:-}"
|
||||
local attempt=1
|
||||
local max_attempts=3
|
||||
local out http
|
||||
while [ "$attempt" -le "$max_attempts" ]; do
|
||||
out=$(gh_api "$method" "$path" "$data")
|
||||
http=$(printf '%s' "$out" | tail -n1)
|
||||
body=$(printf '%s' "$out" | sed '$d')
|
||||
if [ "$http" -ge 500 ] || [ "$http" = "000" ]; then
|
||||
log "retry $attempt/$max_attempts on $method $path (http=$http)"
|
||||
attempt=$((attempt + 1))
|
||||
sleep $((2 ** attempt))
|
||||
continue
|
||||
fi
|
||||
printf '%s\n%s' "$body" "$http"
|
||||
return 0
|
||||
done
|
||||
die "GitHub API 5xx after $max_attempts attempts on $method $path"
|
||||
}
|
||||
|
||||
# Helpers that parse the combined body+http form.
|
||||
gh_http() { printf '%s' "$1" | tail -n1; }
|
||||
gh_body() { printf '%s' "$1" | sed '$d'; }
|
||||
|
||||
# --- Upstream sanity checks ---
|
||||
log "checking upstream $upstream_repo"
|
||||
resp=$(gh_api_retry GET "/repos/$UP_OWNER/$UP_NAME")
|
||||
http=$(gh_http "$resp"); body=$(gh_body "$resp")
|
||||
if [ "$http" = "404" ]; then skip "upstream repo not found (may be private or deleted): $upstream_repo"; fi
|
||||
[ "$http" = "200" ] || die "GET upstream failed http=$http body=$body"
|
||||
|
||||
archived=$(printf '%s' "$body" | jq -r '.archived')
|
||||
default_branch=$(printf '%s' "$body" | jq -r '.default_branch')
|
||||
[ "$archived" = "true" ] && skip "upstream is archived — not opening PR"
|
||||
[ -n "$default_branch" ] || die "could not determine upstream default branch"
|
||||
log "upstream default branch: $default_branch"
|
||||
|
||||
# If we wrote the Dockerfile from scratch, make sure one didn't land upstream meanwhile.
|
||||
if [ "$dockerfile_state" = "written-from-scratch" ]; then
|
||||
resp=$(gh_api_retry GET "/repos/$UP_OWNER/$UP_NAME/contents/Dockerfile?ref=$default_branch")
|
||||
http=$(gh_http "$resp")
|
||||
if [ "$http" = "200" ]; then
|
||||
skip "a Dockerfile landed upstream since we started — aborting to avoid clobbering"
|
||||
fi
|
||||
fi
|
||||
|
||||
# Check for an existing open PR from our fork.
|
||||
resp=$(gh_api_retry GET "/repos/$UP_OWNER/$UP_NAME/pulls?state=open&head=${FORK_OWNER}:${BRANCH_NAME}")
|
||||
http=$(gh_http "$resp"); body=$(gh_body "$resp")
|
||||
if [ "$http" = "200" ]; then
|
||||
existing=$(printf '%s' "$body" | jq -r '.[0].html_url // ""')
|
||||
if [ -n "$existing" ]; then
|
||||
log "existing open PR found: $existing — recording and skipping"
|
||||
jq --arg url "$existing" '.contribution_pr_url = $url' "$STATE_FILE" > "$STATE_FILE.tmp" && mv "$STATE_FILE.tmp" "$STATE_FILE"
|
||||
exit 0
|
||||
fi
|
||||
fi
|
||||
|
||||
# --- Fork ---
|
||||
log "ensuring fork exists at $FORK_OWNER/$UP_NAME"
|
||||
resp=$(gh_api_retry POST "/repos/$UP_OWNER/$UP_NAME/forks" '{}')
|
||||
http=$(gh_http "$resp")
|
||||
if [ "$http" != "202" ] && [ "$http" != "200" ]; then
|
||||
die "fork call failed http=$http"
|
||||
fi
|
||||
|
||||
# Wait for fork to be ready (GitHub can take up to ~30s).
|
||||
for i in $(seq 1 15); do
|
||||
resp=$(gh_api_retry GET "/repos/$FORK_OWNER/$UP_NAME")
|
||||
if [ "$(gh_http "$resp")" = "200" ]; then break; fi
|
||||
sleep 2
|
||||
done
|
||||
[ "$(gh_http "$resp")" = "200" ] || die "fork $FORK_OWNER/$UP_NAME did not become ready"
|
||||
|
||||
# --- Sync fork with upstream default branch ---
|
||||
log "syncing fork with upstream/$default_branch"
|
||||
resp=$(gh_api_retry POST "/repos/$FORK_OWNER/$UP_NAME/merge-upstream" "$(jq -n --arg b "$default_branch" '{branch:$b}')")
|
||||
http=$(gh_http "$resp")
|
||||
[ "$http" = "200" ] || [ "$http" = "409" ] || log "merge-upstream returned http=$http (continuing)"
|
||||
|
||||
# --- Determine base SHA for new branch ---
|
||||
resp=$(gh_api_retry GET "/repos/$FORK_OWNER/$UP_NAME/git/ref/heads/$default_branch")
|
||||
http=$(gh_http "$resp"); body=$(gh_body "$resp")
|
||||
[ "$http" = "200" ] || die "could not read default branch ref on fork (http=$http)"
|
||||
base_sha=$(printf '%s' "$body" | jq -r '.object.sha')
|
||||
|
||||
# --- Create branch (or append timestamp on collision) ---
|
||||
attempt_branch="$BRANCH_NAME"
|
||||
resp=$(gh_api_retry GET "/repos/$FORK_OWNER/$UP_NAME/git/ref/heads/$attempt_branch")
|
||||
if [ "$(gh_http "$resp")" = "200" ]; then
|
||||
attempt_branch="${BRANCH_NAME}-$(date +%s | tail -c 9)"
|
||||
log "branch existed; using $attempt_branch"
|
||||
fi
|
||||
|
||||
log "creating branch $attempt_branch off $base_sha"
|
||||
payload=$(jq -n --arg r "refs/heads/$attempt_branch" --arg s "$base_sha" '{ref:$r,sha:$s}')
|
||||
resp=$(gh_api_retry POST "/repos/$FORK_OWNER/$UP_NAME/git/refs" "$payload")
|
||||
[ "$(gh_http "$resp")" = "201" ] || die "could not create branch: $(gh_body "$resp")"
|
||||
|
||||
# --- Helper to PUT a file via Contents API ---
|
||||
put_file() {
|
||||
local src="$1"; local dst="$2"; local message="$3"
|
||||
local b64 payload exists_resp http existing_sha=""
|
||||
b64=$(base64 -w0 < "$src")
|
||||
|
||||
exists_resp=$(gh_api_retry GET "/repos/$FORK_OWNER/$UP_NAME/contents/$dst?ref=$attempt_branch")
|
||||
if [ "$(gh_http "$exists_resp")" = "200" ]; then
|
||||
existing_sha=$(gh_body "$exists_resp" | jq -r '.sha')
|
||||
fi
|
||||
|
||||
if [ -n "$existing_sha" ]; then
|
||||
payload=$(jq -n --arg m "$message" --arg c "$b64" --arg b "$attempt_branch" --arg sha "$existing_sha" \
|
||||
'{message:$m, content:$c, branch:$b, sha:$sha}')
|
||||
else
|
||||
payload=$(jq -n --arg m "$message" --arg c "$b64" --arg b "$attempt_branch" \
|
||||
'{message:$m, content:$c, branch:$b}')
|
||||
fi
|
||||
|
||||
resp=$(gh_api_retry PUT "/repos/$FORK_OWNER/$UP_NAME/contents/$dst" "$payload")
|
||||
http=$(gh_http "$resp")
|
||||
[ "$http" = "200" ] || [ "$http" = "201" ] || die "PUT $dst failed http=$http body=$(gh_body "$resp")"
|
||||
}
|
||||
|
||||
commit_msg_prefix="Add Dockerfile"
|
||||
[ "$dockerfile_state" = "fixed-broken-upstream" ] && commit_msg_prefix="Fix Dockerfile"
|
||||
|
||||
log "committing Dockerfile, .dockerignore, BUILD.md"
|
||||
put_file "$DOCKERFILE_SRC" "Dockerfile" "$commit_msg_prefix
|
||||
|
||||
Signed-off-by: Viktor Barzin <viktorbarzin@meta.com>"
|
||||
put_file "$DOCKERIGNORE_SRC" ".dockerignore" "Add .dockerignore
|
||||
|
||||
Signed-off-by: Viktor Barzin <viktorbarzin@meta.com>"
|
||||
put_file "$BUILDMD_SRC" "BUILD.md" "Add BUILD.md
|
||||
|
||||
Signed-off-by: Viktor Barzin <viktorbarzin@meta.com>"
|
||||
|
||||
# --- Render PR body ---
|
||||
reason_paragraph="This project currently has no Dockerfile, making it harder for the self-hosting community to run this. I put together a working one while deploying this app to my home Kubernetes cluster and wanted to upstream it."
|
||||
if [ "$reason_type" = "broken" ]; then
|
||||
reason_paragraph="The existing Dockerfile in this repo does not build cleanly for \`linux/amd64\`. I tracked down the fixes while deploying this app to my home Kubernetes cluster and wanted to upstream them."
|
||||
fi
|
||||
|
||||
IMAGE_SIZE=$(jq -r '.image_size // "unknown"' "$STATE_FILE")
|
||||
BASE_IMAGE=$(jq -r '.base_image // "unknown"' "$STATE_FILE")
|
||||
IMAGE_TAG=$(jq -r '.image_tag // "myapp:latest"' "$STATE_FILE")
|
||||
DOCKERFILE_SHAPE=$(jq -r '.dockerfile_shape // "multi-stage, non-root, linux/amd64"' "$STATE_FILE")
|
||||
|
||||
pr_body=$(cat "$TEMPLATES_DIR/PR_BODY.md")
|
||||
pr_body="${pr_body//\{\{REASON_PARAGRAPH\}\}/$reason_paragraph}"
|
||||
pr_body="${pr_body//\{\{DOCKERFILE_SHAPE\}\}/$DOCKERFILE_SHAPE}"
|
||||
pr_body="${pr_body//\{\{IMAGE_SIZE\}\}/$IMAGE_SIZE}"
|
||||
pr_body="${pr_body//\{\{BASE_IMAGE\}\}/$BASE_IMAGE}"
|
||||
pr_body="${pr_body//\{\{IMAGE_TAG\}\}/$IMAGE_TAG}"
|
||||
|
||||
pr_title="$commit_msg_prefix"
|
||||
|
||||
# --- Open PR ---
|
||||
log "opening PR against $UP_OWNER/$UP_NAME:$default_branch"
|
||||
payload=$(jq -n \
|
||||
--arg t "$pr_title" \
|
||||
--arg h "${FORK_OWNER}:${attempt_branch}" \
|
||||
--arg b "$default_branch" \
|
||||
--arg body "$pr_body" \
|
||||
'{title:$t, head:$h, base:$b, body:$body, maintainer_can_modify:true}')
|
||||
resp=$(gh_api_retry POST "/repos/$UP_OWNER/$UP_NAME/pulls" "$payload")
|
||||
http=$(gh_http "$resp"); body=$(gh_body "$resp")
|
||||
if [ "$http" != "201" ]; then
|
||||
die "PR creation failed http=$http body=$body"
|
||||
fi
|
||||
|
||||
pr_url=$(printf '%s' "$body" | jq -r '.html_url')
|
||||
log "PR opened: $pr_url"
|
||||
|
||||
# --- Record PR URL in state file ---
|
||||
jq --arg url "$pr_url" '.contribution_pr_url = $url' "$STATE_FILE" > "$STATE_FILE.tmp" && mv "$STATE_FILE.tmp" "$STATE_FILE"
|
||||
log "state file updated with PR URL"
|
||||
|
|
@ -1,71 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# 10-minute deploy stability gate for setup-project skill.
|
||||
# Polls pod readiness + HTTP 200 on target URL every 30s for 20 iterations.
|
||||
# Requires 18/20 probes to succeed (tolerates 2 blips for restarts/DNS propagation).
|
||||
#
|
||||
# Usage:
|
||||
# stability-gate.sh <namespace> <app-label> <url>
|
||||
#
|
||||
# Example:
|
||||
# stability-gate.sh myapp myapp https://myapp.viktorbarzin.me
|
||||
#
|
||||
# Exit codes:
|
||||
# 0 - Stable (>=18/20 probes OK)
|
||||
# 1 - Unstable (<18/20 probes OK)
|
||||
# 2 - Usage error
|
||||
|
||||
set -u
|
||||
|
||||
if [ "$#" -ne 3 ]; then
|
||||
echo "Usage: $0 <namespace> <app-label> <url>" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
NS="$1"
|
||||
APP="$2"
|
||||
URL="$3"
|
||||
|
||||
TOTAL_PROBES=20
|
||||
MIN_SUCCESSES=18
|
||||
INTERVAL_SECONDS=30
|
||||
|
||||
ok_count=0
|
||||
fail_count=0
|
||||
|
||||
echo "stability-gate: ns=$NS app=$APP url=$URL"
|
||||
echo "stability-gate: $TOTAL_PROBES probes x ${INTERVAL_SECONDS}s (need $MIN_SUCCESSES/$TOTAL_PROBES)"
|
||||
|
||||
for i in $(seq 1 "$TOTAL_PROBES"); do
|
||||
probe_ok=true
|
||||
|
||||
if ! kubectl wait --for=condition=Ready pod -l "app=$APP" -n "$NS" --timeout=25s >/dev/null 2>&1; then
|
||||
probe_ok=false
|
||||
fi
|
||||
|
||||
status=$(curl -sS -o /dev/null -w "%{http_code}" --max-time 10 "$URL" || echo "000")
|
||||
if [ "$status" != "200" ]; then
|
||||
probe_ok=false
|
||||
fi
|
||||
|
||||
if [ "$probe_ok" = "true" ]; then
|
||||
ok_count=$((ok_count + 1))
|
||||
printf " probe %2d/%d: OK (http=%s)\n" "$i" "$TOTAL_PROBES" "$status"
|
||||
else
|
||||
fail_count=$((fail_count + 1))
|
||||
printf " probe %2d/%d: FAIL (http=%s)\n" "$i" "$TOTAL_PROBES" "$status"
|
||||
fi
|
||||
|
||||
if [ "$i" -lt "$TOTAL_PROBES" ]; then
|
||||
sleep "$INTERVAL_SECONDS"
|
||||
fi
|
||||
done
|
||||
|
||||
echo "stability-gate: results ok=$ok_count fail=$fail_count"
|
||||
|
||||
if [ "$ok_count" -ge "$MIN_SUCCESSES" ]; then
|
||||
echo "stability-gate: PASS"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "stability-gate: FAIL (need $MIN_SUCCESSES, got $ok_count)" >&2
|
||||
exit 1
|
||||
|
|
@ -1,24 +0,0 @@
|
|||
# Build notes
|
||||
|
||||
## Build
|
||||
|
||||
```
|
||||
docker build --platform linux/amd64 -t {{IMAGE_NAME}}:{{TAG}} .
|
||||
```
|
||||
|
||||
## Run
|
||||
|
||||
```
|
||||
docker run --rm -p {{CONTAINER_PORT}}:{{CONTAINER_PORT}} {{IMAGE_NAME}}:{{TAG}}
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
{{ENV_VARS_TABLE}}
|
||||
|
||||
## Notes
|
||||
|
||||
- Built for `linux/amd64`; multi-arch not tested.
|
||||
- Image size: `{{IMAGE_SIZE}}`, base: `{{BASE_IMAGE}}`.
|
||||
- Runs as a non-root user.
|
||||
{{EXTRA_NOTES}}
|
||||
|
|
@ -1,25 +0,0 @@
|
|||
## Add a working Dockerfile
|
||||
|
||||
### Why
|
||||
{{REASON_PARAGRAPH}}
|
||||
|
||||
### What this adds
|
||||
- `Dockerfile` — {{DOCKERFILE_SHAPE}}
|
||||
- `.dockerignore`
|
||||
- `BUILD.md` with the build command and notes
|
||||
|
||||
### Tested
|
||||
- Built and pushed to a private registry, deployed to a Kubernetes cluster.
|
||||
- Pod has been Ready and serving HTTP 200 at the ingress for 10+ minutes of continuous probing before this PR was opened.
|
||||
- Image size: {{IMAGE_SIZE}}, base: {{BASE_IMAGE}}
|
||||
- Platform tested: `linux/amd64`
|
||||
|
||||
### Build command
|
||||
```
|
||||
docker build --platform linux/amd64 -t {{IMAGE_TAG}} .
|
||||
```
|
||||
|
||||
Happy to iterate on base image, build args, or multi-arch support if you'd prefer a different shape. Thanks for the project!
|
||||
|
||||
---
|
||||
<sub>Contributed after self-hosting this project. Filed by the repo owner's deployment workflow; feel free to mention me (@ViktorBarzin) with any follow-ups.</sub>
|
||||
|
|
@ -1,199 +0,0 @@
|
|||
---
|
||||
name: upgrade-state
|
||||
description: |
|
||||
Audit the three autonomous-upgrade pipelines (apps via Keel, OS via
|
||||
unattended-upgrades+kured, K8s components via the version-check chain).
|
||||
Use when:
|
||||
(1) User asks "/upgrade-state" or "are we current",
|
||||
(2) User asks "what's pending upgrade" or "what's the upgrade state",
|
||||
(3) User asks if Keel / kured / k8s-version-check is healthy,
|
||||
(4) User asks about kept-back / held packages or pending reboots,
|
||||
(5) Periodic survey before the next `k8s-version-check` daily run.
|
||||
Read-only — no `--fix`. Exits 0 healthy / 1 attention / 2 stalled.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-05-18
|
||||
---
|
||||
|
||||
# Upgrade-state
|
||||
|
||||
## MANDATORY: Run the script first
|
||||
|
||||
When this skill is invoked, your **first action** must be to run
|
||||
`upgrade_state.sh` and reason over its output before doing anything
|
||||
else. Do NOT improvise individual `kubectl` / `ssh` calls — the script
|
||||
is the authoritative surface.
|
||||
|
||||
```bash
|
||||
bash /home/wizard/code/infra/scripts/upgrade_state.sh
|
||||
```
|
||||
|
||||
For programmatic use:
|
||||
|
||||
```bash
|
||||
bash /home/wizard/code/infra/scripts/upgrade_state.sh --json | tee /tmp/upgrade-state.json
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
1. Report the rendered table verbatim — it answers the user's
|
||||
"are we current" question in three lines.
|
||||
2. For every `⚠` or `✗` row, surface the relevant drill-down lines
|
||||
underneath and propose a next action (links in the table below).
|
||||
3. Only reach for ad-hoc commands when investigating beyond what the
|
||||
script reported.
|
||||
|
||||
Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
|
||||
|
||||
## What it covers (3 pipelines)
|
||||
|
||||
| Layer | What runs | Cadence | Data sources |
|
||||
|---|---|---|---|
|
||||
| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
|
||||
| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
|
||||
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
|
||||
|
||||
The K8s pipeline pushes a small set of gauges to the Prometheus
|
||||
Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
|
||||
|
||||
- `k8s_upgrade_available{kind="patch"|"minor",target=…}` — 1 if newer release detected
|
||||
- `k8s_version_check_last_run_timestamp` — when detection last ran
|
||||
- `k8s_upgrade_in_flight` — 0/1
|
||||
- `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)
|
||||
|
||||
`K8sUpgradeStalled` alert fires when `in_flight=1` and the chain has
|
||||
been running >90 minutes. The script raises `✗` in the same window.
|
||||
|
||||
## Status-icon legend
|
||||
|
||||
| Icon | Meaning |
|
||||
|---|---|
|
||||
| `✓` | Healthy, fully current |
|
||||
| `→` | Update available, not yet applied (K8s patch/minor) |
|
||||
| `…` | In flight — chain currently running |
|
||||
| `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
|
||||
| `✗` | Broken: pod down, alert firing, chain stalled |
|
||||
|
||||
## Drill-down — when a row trips, what to do
|
||||
|
||||
### Apps `⚠` — pending approvals or errors
|
||||
|
||||
```bash
|
||||
# Read recent Keel log lines
|
||||
kubectl -n keel logs deploy/keel --since=24h --tail=200
|
||||
|
||||
# What is Keel currently tracking?
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/query?query=count by (image) (registries_scanned_total)'
|
||||
|
||||
# Is the scrape live?
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kubernetes-pods",app="keel"}'
|
||||
```
|
||||
|
||||
Common Keel errors:
|
||||
- `failed to add image watch job` — image annotation mistyped (rare; Kyverno auto-injects)
|
||||
- `registry authentication required` — bad imagePullSecret on the watched Deployment
|
||||
- `bad tag pattern` — Keel can't parse the watched image's tag against its policy
|
||||
|
||||
### OS `⚠` — held packages with bumps
|
||||
|
||||
The script flags any package held via `apt-mark hold` that ALSO appears
|
||||
in `apt list --upgradable` — excluding k8s components (the K8s pipeline
|
||||
owns those) and the kernel (kured handles the reboot half).
|
||||
|
||||
Typical cause: a major-version bump (e.g. containerd 1.7 → 2.2,
|
||||
runc 1.1 → 1.4). These are held because they need cluster-wide
|
||||
coordination, not silent in-release patching.
|
||||
|
||||
```bash
|
||||
# Inspect the situation on the flagged node
|
||||
ssh wizard@10.0.20.10X 'apt-mark showhold; apt list --upgradable 2>/dev/null'
|
||||
|
||||
# Unhold + upgrade a specific package
|
||||
ssh wizard@10.0.20.10X 'sudo apt-mark unhold containerd && sudo apt-get install -y containerd'
|
||||
```
|
||||
|
||||
Node IPs: master=`100`, node1=`101`, node2=`102`, node3=`103`, node4=`104`.
|
||||
|
||||
### OS `⚠` — pending reboot
|
||||
|
||||
A node has `/var/run/reboot-required`. Kured will reboot it inside the
|
||||
next 02:00-06:00 London window (any day of the week).
|
||||
|
||||
```bash
|
||||
# Force a manual reboot inside the window (rare)
|
||||
kubectl drain k8s-nodeX --delete-emptydir-data --ignore-daemonsets
|
||||
ssh wizard@10.0.20.10X sudo systemctl reboot
|
||||
```
|
||||
|
||||
### OS `✗` — kured not Running
|
||||
|
||||
```bash
|
||||
kubectl -n kured get pods
|
||||
kubectl -n kured logs daemonset/kured --tail=100
|
||||
# Verify sentinel gate (kured-sentinel-gate DaemonSet writes /var/run/gated-reboot-required)
|
||||
kubectl -n kured get pods -l name=kured-sentinel-gate
|
||||
```
|
||||
|
||||
### K8s `→` — patch/minor available
|
||||
|
||||
Detection ran, target identified, chain NOT started. The chain spawns
|
||||
on the same daily detection cycle — typically within ~24h of the
|
||||
target first being detected.
|
||||
|
||||
```bash
|
||||
# Inspect Pushgateway state
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep ^k8s_upgrade
|
||||
|
||||
# Trigger a manual run of the detection CronJob
|
||||
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
|
||||
```
|
||||
|
||||
### K8s `…` — in flight
|
||||
|
||||
The Job chain is running. Watch its progress:
|
||||
|
||||
```bash
|
||||
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp
|
||||
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200 --prefix
|
||||
```
|
||||
|
||||
### K8s `✗ stalled` — `K8sUpgradeStalled` would fire
|
||||
|
||||
Chain in-flight >90m. The Job is most likely stuck on drain or a
|
||||
pre-flight check.
|
||||
|
||||
```bash
|
||||
kubectl -n k8s-upgrade get jobs
|
||||
kubectl -n k8s-upgrade describe job <stuck-job>
|
||||
kubectl -n k8s-upgrade logs job/<stuck-job> --tail=300
|
||||
|
||||
# If you need to clear the in-flight flag (after diagnosing):
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -c \
|
||||
"printf 'k8s_upgrade_in_flight 0\nk8s_upgrade_started_timestamp 0\n' | \
|
||||
wget -qO- --post-file=- 'http://prometheus-prometheus-pushgateway:9091/metrics/job/k8s-version-upgrade' \
|
||||
--header='Content-Type: text/plain'"
|
||||
```
|
||||
|
||||
### K8s `✗ detection stale` — last detection >9 days
|
||||
|
||||
```bash
|
||||
kubectl -n k8s-upgrade get cronjob k8s-version-check
|
||||
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp | tail -5
|
||||
```
|
||||
|
||||
If the CronJob hasn't fired on time, suspect:
|
||||
- `suspend=true` on the CronJob (`var.enabled=false` in the
|
||||
`k8s-version-upgrade` Terraform stack)
|
||||
- Image-pull failure on the version-check pod
|
||||
- Pushgateway scrape gone stale
|
||||
|
||||
## Companion command-line flags
|
||||
|
||||
```bash
|
||||
bash infra/scripts/upgrade_state.sh # rendered table (default)
|
||||
bash infra/scripts/upgrade_state.sh --json # machine output
|
||||
bash infra/scripts/upgrade_state.sh --kubeconfig X # override kubeconfig
|
||||
```
|
||||
|
|
@ -1,173 +0,0 @@
|
|||
---
|
||||
name: uptime-kuma
|
||||
description: |
|
||||
Manage Uptime Kuma monitoring via the Python API. Use when:
|
||||
(1) User asks to add, remove, or list monitors,
|
||||
(2) User asks about service uptime or monitoring status,
|
||||
(3) User asks to check what's being monitored,
|
||||
(4) User deploys a new service and needs monitoring added,
|
||||
(5) User mentions "uptime", "monitoring", "health check", or "uptime kuma".
|
||||
Uptime Kuma v2 running in Kubernetes, managed via uptime-kuma-api Python library.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-14
|
||||
---
|
||||
|
||||
# Uptime Kuma Monitoring Management
|
||||
|
||||
## Overview
|
||||
- **URL**: `https://uptime.viktorbarzin.me`
|
||||
- **Internal**: `uptime-kuma.uptime-kuma.svc.cluster.local:80`
|
||||
- **Image**: `louislam/uptime-kuma:2`
|
||||
- **Storage**: NFS at `/mnt/main/uptime-kuma` -> `/app/data`
|
||||
- **API Library**: `uptime-kuma-api` (pip, available via PYTHONPATH)
|
||||
- **Credentials**: admin / (from `UPTIME_KUMA_PASSWORD` env var)
|
||||
|
||||
## Python API Access
|
||||
|
||||
### Connection Pattern
|
||||
```python
|
||||
import os
|
||||
from uptime_kuma_api import UptimeKumaApi, MonitorType
|
||||
|
||||
api = UptimeKumaApi('https://uptime.viktorbarzin.me')
|
||||
api.login('admin', os.environ.get('UPTIME_KUMA_PASSWORD', ''))
|
||||
|
||||
# ... operations ...
|
||||
|
||||
api.disconnect()
|
||||
```
|
||||
|
||||
### Execution
|
||||
```bash
|
||||
python3 -c "
|
||||
import os
|
||||
from uptime_kuma_api import UptimeKumaApi, MonitorType
|
||||
api = UptimeKumaApi('https://uptime.viktorbarzin.me')
|
||||
api.login('admin', os.environ.get('UPTIME_KUMA_PASSWORD', ''))
|
||||
# ... your code ...
|
||||
api.disconnect()
|
||||
"
|
||||
```
|
||||
|
||||
### Common Operations
|
||||
|
||||
#### List All Monitors
|
||||
```python
|
||||
monitors = api.get_monitors()
|
||||
for m in monitors:
|
||||
print(f'{m["id"]:3d} | {m["name"]:30s} | {m["type"]:15s} | interval={m["interval"]}s')
|
||||
```
|
||||
|
||||
#### Add HTTP Monitor
|
||||
```python
|
||||
api.add_monitor(
|
||||
type=MonitorType.HTTP,
|
||||
name="Service Name",
|
||||
url="http://service.namespace.svc.cluster.local",
|
||||
interval=120,
|
||||
maxretries=2,
|
||||
)
|
||||
```
|
||||
|
||||
#### Add PING Monitor
|
||||
```python
|
||||
api.add_monitor(
|
||||
type=MonitorType.PING,
|
||||
name="Host Name",
|
||||
hostname="10.0.20.1",
|
||||
interval=30,
|
||||
maxretries=3,
|
||||
)
|
||||
```
|
||||
|
||||
#### Add PORT Monitor
|
||||
```python
|
||||
api.add_monitor(
|
||||
type=MonitorType.PORT,
|
||||
name="Service Port",
|
||||
hostname="service.namespace.svc.cluster.local",
|
||||
port=8080,
|
||||
interval=120,
|
||||
maxretries=2,
|
||||
)
|
||||
```
|
||||
|
||||
#### Edit Monitor
|
||||
```python
|
||||
api.edit_monitor(monitor_id, interval=120, maxretries=2)
|
||||
```
|
||||
|
||||
#### Delete Monitor
|
||||
```python
|
||||
api.delete_monitor(monitor_id)
|
||||
```
|
||||
|
||||
#### Pause/Resume Monitor
|
||||
```python
|
||||
api.pause_monitor(monitor_id)
|
||||
api.resume_monitor(monitor_id)
|
||||
```
|
||||
|
||||
## Monitor Types
|
||||
- `MonitorType.HTTP` — HTTP(S) endpoint check
|
||||
- `MonitorType.PING` — ICMP ping
|
||||
- `MonitorType.PORT` — TCP port check
|
||||
- `MonitorType.POSTGRES` — PostgreSQL connection
|
||||
- `MonitorType.REDIS` — Redis connection
|
||||
- `MonitorType.DNS` — DNS resolution check
|
||||
|
||||
## Tiered Monitoring System
|
||||
|
||||
Monitors use tiered intervals to balance responsiveness with resource usage:
|
||||
|
||||
| Tier | Interval | Retries | Use For |
|
||||
|------|----------|---------|---------|
|
||||
| **1 - Critical** | 30s | 3 | Core infra (DNS, gateway, ingress, NFS, K8s API, auth, mail) |
|
||||
| **2 - Important** | 120s | 2 | Actively used services (Nextcloud, Immich, Vaultwarden, etc.) |
|
||||
| **3 - Standard** | 300s | 1 | Auxiliary/optional services (blog, games, tools) |
|
||||
|
||||
### Tier Assignment Guidelines
|
||||
- **Tier 1**: If it goes down, multiple other services fail or the cluster is unreachable
|
||||
- **Tier 2**: User-facing services that are actively used daily
|
||||
- **Tier 3**: Nice-to-have services, tools, dashboards
|
||||
|
||||
### When Adding a New Service
|
||||
Match the tier to the service's DEFCON level from CLAUDE.md:
|
||||
- DEFCON 1-2 → Tier 1 (30s)
|
||||
- DEFCON 3-4 → Tier 2 (120s)
|
||||
- DEFCON 5 → Tier 3 (300s)
|
||||
|
||||
## Internal Service URL Pattern
|
||||
Most K8s services follow: `http://<service-name>.<namespace>.svc.cluster.local:<port>`
|
||||
|
||||
Common port is 80. Exceptions:
|
||||
- Homepage: port 3000
|
||||
- Ollama: port 11434
|
||||
- Loki: port 3100 (use `/ready` endpoint)
|
||||
- Traefik dashboard: port 8080 (use `/dashboard/` path)
|
||||
- K8s API: `https://10.0.20.100:6443`
|
||||
- Immich: port 2283 (use `/api/server/ping`)
|
||||
|
||||
## Notes
|
||||
1. Uptime Kuma uses Socket.IO (WebSocket) for its API, not REST
|
||||
2. The `uptime-kuma-api` Python library wraps Socket.IO
|
||||
3. Add `time.sleep(0.3)` between bulk operations to avoid overloading
|
||||
4. Homepage dashboard widget slug: `cluster-internal`
|
||||
5. Cloudflare-proxied at `uptime.viktorbarzin.me`
|
||||
|
||||
## Terraform-Managed Monitors
|
||||
|
||||
There is NO `louislam/uptime-kuma` Terraform provider. Two patterns exist for
|
||||
declarative monitor management in this stack:
|
||||
|
||||
- **External HTTPS monitors** — auto-discovered from ingress annotations by the
|
||||
`external-monitor-sync` CronJob (`*/10 * * * *`). Opt-out via
|
||||
`uptime.viktorbarzin.me/external-monitor: "false"` on the ingress.
|
||||
- **Internal monitors (DBs, non-HTTP)** — declared in the
|
||||
`local.internal_monitors` list in `stacks/uptime-kuma/modules/uptime-kuma/main.tf`
|
||||
and synced by the `internal-monitor-sync` CronJob. To add one, append to the
|
||||
list (provide `name`, `type`, `database_connection_string`,
|
||||
`database_password_vault_key`, `interval`, `retry_interval`, `max_retries`)
|
||||
and `scripts/tg apply`. The sync is idempotent — looks up by name, creates
|
||||
if missing, patches if drifted. Existing monitors keep their id and history.
|
||||
4
.git-crypt/.gitattributes
vendored
4
.git-crypt/.gitattributes
vendored
|
|
@ -1,4 +0,0 @@
|
|||
# Do not edit this file. To specify the files to encrypt, create your own
|
||||
# .gitattributes file in the directory where your files are.
|
||||
* !filter !diff
|
||||
*.gpg binary
|
||||
Binary file not shown.
Binary file not shown.
6
.gitattributes
vendored
6
.gitattributes
vendored
|
|
@ -1,6 +0,0 @@
|
|||
.gitattributes !filter !diff
|
||||
|
||||
*.tfstate filter=git-crypt diff=git-crypt
|
||||
*.tfvars filter=git-crypt diff=git-crypt
|
||||
secrets/** filter=git-crypt diff=git-crypt
|
||||
stacks/**/secrets/** filter=git-crypt diff=git-crypt
|
||||
5
.github/ISSUE_TEMPLATE/config.yml
vendored
5
.github/ISSUE_TEMPLATE/config.yml
vendored
|
|
@ -1,5 +0,0 @@
|
|||
blank_issues_enabled: true
|
||||
contact_links:
|
||||
- name: Service Status
|
||||
url: https://status.viktorbarzin.me
|
||||
about: Check current service status and active incidents
|
||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue