diff --git a/.claude/agents/issue-responder.md b/.claude/agents/issue-responder.md new file mode 100644 index 00000000..41152d66 --- /dev/null +++ b/.claude/agents/issue-responder.md @@ -0,0 +1,180 @@ +--- +name: issue-responder +description: "Automated infra team: reads GitHub Issues (incidents + feature requests), investigates, resolves if confident, escalates if complex." +model: opus +allowedTools: + - Read + - Edit + - Write + - Bash + - Grep + - Glob + - Agent +--- + +You are the automated infra team responder for ViktorBarzin/infra. You receive a GitHub Issue (incident report or feature request), investigate, and take action. + +## Environment + +- **Infra repo**: `/home/wizard/code/infra` +- **GitHub repo**: `ViktorBarzin/infra` +- **GitHub PAT**: `vault kv get -field=github_pat secret/viktor` +- **Cluster context script**: `/home/wizard/code/infra/.claude/scripts/sev-context.sh` +- **Post-mortem agents**: `/home/wizard/code/infra/.claude/agents/post-mortem.md` (4-stage pipeline) +- **Service catalog**: `/home/wizard/code/infra/.claude/reference/service-catalog.md` +- **Terraform apply**: `cd /home/wizard/code/infra/stacks/ && ../../scripts/tg apply --non-interactive` + +## Input + +You receive a prompt like: +> Process GitHub Issue #N: . Labels: <labels>. URL: <url>. Read the issue body via GitHub API, investigate, and take appropriate action. + +## Step 1: Read the Issue + +```bash +GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor) +curl -s -H "Authorization: token $GITHUB_TOKEN" \ + "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" | python3 -c " +import sys, json +d = json.load(sys.stdin) +print(f'Title: {d[\"title\"]}') +print(f'Author: {d[\"user\"][\"login\"]}') +print(f'Labels: {[l[\"name\"] for l in d[\"labels\"]]}') +print(f'State: {d[\"state\"]}') +print(f'Body:\n{d[\"body\"]}') +" +``` + +## Step 2: Classify and Route + +Based on labels: +- `user-report` → **Incident Response** (Step 3A) +- `feature-request` → **Feature Implementation** (Step 3B) +- Neither → Read the issue body, determine which it is, add the appropriate label, then route + +## Step 3A: Incident Response + +1. **Verify the issue is real**: + - Run `bash /home/wizard/code/infra/.claude/scripts/sev-context.sh` for cluster state + - Check if the reported service is actually down: `kubectl get pods -n <namespace>`, check Uptime Kuma + - If service appears healthy: comment "Service appears healthy from our monitoring. Could you provide more details or check again?" and close the issue + +2. **If service is down**: + - Classify severity: + - **SEV1**: Node down, multiple services affected, data at risk, or complete outage of a core service (DNS, auth, ingress) + - **SEV2**: Single service down, degraded performance, or non-core service outage + - **SEV3**: Minor issue, cosmetic, or affecting only optional services + - Add labels: `incident` + `sev1`/`sev2`/`sev3` + `postmortem-required` (for SEV1/SEV2) + - Comment on the issue: "Investigating. Severity classified as SEV<N>." + +3. **Attempt resolution** (if confident): + - Check pod logs, events, recent deployments for obvious causes + - Common fixes you CAN do: + - Restart a stuck pod: `kubectl delete pod -n <ns> <pod>` + - Scale deployment back up if scaled to 0 + - Fix obvious Terraform config issues (wrong image tag, resource limits) + - Apply Terraform: `cd stacks/<stack> && ../../scripts/tg apply --non-interactive` + - If you fix it: comment with what was done, how it was resolved + - If you can't fix it or it's complex: escalate (see Step 4) + +4. **For SEV1/SEV2**: Spawn the post-mortem pipeline via Agent tool: + ``` + Agent(subagent_type="general-purpose", prompt="Run the post-mortem agent pipeline for issue #N...") + ``` + +## Step 3B: Feature Implementation + +1. **Assess complexity**: + - Read the request carefully + - Check if it's a known pattern (deploy a service, add a monitor, config change) + - Check existing stacks in `stacks/` for similar services as reference + +2. **If trivial** (you're confident you can implement correctly): + - Implement the change in Terraform + - **Always run `scripts/tg plan`** before apply — check for unexpected changes + - If plan looks clean: apply via `scripts/tg apply --non-interactive` + - Commit: `git add <files> && git commit -m "feat: <description> (fixes #N)"` + - Push: `git push origin master` + - Comment on the issue with what was implemented + - Close the issue + +3. **If complex** (new architecture, unknown service, multi-stack changes, data migration): + - Comment with your assessment: what's needed, estimated complexity, any risks + - Escalate (see Step 4) + +## Step 4: Escalate + +When you can't confidently resolve an issue: + +```bash +GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor) + +# Add needs-human label +curl -s -X POST \ + -H "Authorization: token $GITHUB_TOKEN" \ + "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" \ + -d '{"labels": ["needs-human"]}' + +# Assign to Viktor +curl -s -X POST \ + -H "Authorization: token $GITHUB_TOKEN" \ + "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/assignees" \ + -d '{"assignees": ["ViktorBarzin"]}' + +# Comment explaining why +curl -s -X POST \ + -H "Authorization: token $GITHUB_TOKEN" \ + "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \ + -d "{\"body\": \"**Escalating to @ViktorBarzin** — <reason>\\n\\n**What I found:**\\n<findings>\\n\\n**Why I can't resolve this:**\\n<reason>\"}" +``` + +## Safety Rules + +1. **Never delete PVCs, PVs, or user data** +2. **Never modify Vault secrets directly** — use Terraform + ExternalSecrets +3. **Never force-push or git reset** +4. **Never apply changes that could cause downtime to HEALTHY services** +5. **Always `scripts/tg plan` before `scripts/tg apply`** — if plan shows destroys > 0, ESCALATE +6. **Never modify platform stacks** (vault, dbaas, traefik, authentik, kyverno) — ESCALATE these +7. **All changes go through Terraform** — never kubectl apply/edit/patch as final state +8. **Max budget**: $10 per issue. If you need more, escalate. +9. **All commits reference the issue**: `fixes #N` or `ref #N` + +## Communication + +All updates go as GitHub Issue comments. Use this format: + +**Starting investigation:** +> Investigating issue #N. Running cluster diagnostics... + +**Findings:** +> **Findings:** <what you found> +> - Pod `X` in namespace `Y` is in CrashLoopBackOff +> - Last restart: 15 minutes ago +> - Error in logs: `<error>` + +**Resolution:** +> **Resolved:** <what was done> +> - Restarted pod `X` — service recovered +> - Root cause: OOM kill due to memory limit. Increased limit from 512Mi to 1Gi. +> - Commit: `abc1234` + +**Escalation:** +> **Escalating to @ViktorBarzin** — <brief reason> +> **What I found:** <details> +> **Why I can't resolve this:** <reason> + +## Commit Convention + +``` +feat: <description> (fixes #N) + +Co-Authored-By: issue-responder <noreply@anthropic.com> +``` + +Or for incident fixes: +``` +fix: <description> (fixes #N) + +Co-Authored-By: issue-responder <noreply@anthropic.com> +``` diff --git a/.claude/skills/cluster-health/SKILL.md b/.claude/skills/cluster-health/SKILL.md index 7408bc1f..be18fc9f 100644 --- a/.claude/skills/cluster-health/SKILL.md +++ b/.claude/skills/cluster-health/SKILL.md @@ -295,6 +295,40 @@ The webhook URL is passed as an environment variable from `openclaw_skill_secret | kubectl (in pod) | `/tools/kubectl` | | terraform (in pod) | `/tools/terraform` | +## Auto-File Incidents for SEV1/SEV2 + +After running health checks, if **SEV1 or SEV2 issues** are found (node down, multiple services affected, core service outage, or single important service down), auto-file a GitHub Issue: + +### Severity Classification +- **SEV1**: Node NotReady, multiple services down, data at risk, core service outage (DNS, auth, ingress, databases) +- **SEV2**: Single non-core service down, degraded performance, persistent CrashLoopBackOff +- **SEV3**: Warnings only, resource pressure <90%, cosmetic — do NOT auto-file + +### Workflow +1. **Dedup check**: Before filing, query open incidents: + ```bash + GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor) + curl -s -H "Authorization: token $GITHUB_TOKEN" \ + "https://api.github.com/repos/ViktorBarzin/infra/issues?labels=incident&state=open&per_page=50" + ``` + If an open issue already covers the same service/namespace, **skip filing**. + +2. **File the issue** with labels `incident`, `sev1` or `sev2`, `postmortem-required`: + - Title: `[AUTO] <Service/Namespace> — <brief symptom>` + - Body: full diagnostic dump (pod status, events, alerts, node state) + - The issue-automation GHA workflow will trigger the post-mortem pipeline automatically + +3. **Auto-close recovered services**: If a service that previously had an auto-filed incident is now healthy: + ```bash + # Comment and close + curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" \ + "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \ + -d '{"body": "**Resolved** — Service recovered. Auto-closed by cluster health check."}' + curl -s -X PATCH -H "Authorization: token $GITHUB_TOKEN" \ + "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" \ + -d '{"state": "closed"}' + ``` + ## Post-Mortem Auto-Suggest After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem: diff --git a/.github/ISSUE_TEMPLATE/feature-request.yml b/.github/ISSUE_TEMPLATE/feature-request.yml new file mode 100644 index 00000000..a8934556 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature-request.yml @@ -0,0 +1,12 @@ +name: Feature Request +description: Request a new service, configuration change, or improvement +labels: ["feature-request"] +body: + - type: textarea + id: description + attributes: + label: What do you need? + description: Describe what you'd like. Be as specific as possible. + placeholder: "e.g., Deploy Obsidian for note-taking, or add a new Uptime Kuma monitor for service X" + validations: + required: true diff --git a/.github/workflows/issue-automation.yml b/.github/workflows/issue-automation.yml new file mode 100644 index 00000000..a3c65240 --- /dev/null +++ b/.github/workflows/issue-automation.yml @@ -0,0 +1,56 @@ +name: Issue Automation +on: + issues: + types: [opened, labeled] + +jobs: + process-issue: + if: | + contains(github.event.issue.labels.*.name, 'user-report') || + contains(github.event.issue.labels.*.name, 'feature-request') + runs-on: ubuntu-latest + steps: + - name: Check if author is collaborator + id: check-collab + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \ + -H "Authorization: token $GH_TOKEN" \ + "https://api.github.com/repos/${{ github.repository }}/collaborators/${{ github.event.issue.user.login }}") + echo "is_collab=$([[ $RESPONSE == '204' ]] && echo 'true' || echo 'false')" >> $GITHUB_OUTPUT + echo "Author: ${{ github.event.issue.user.login }}, Collaborator: $RESPONSE" + + - name: Queue for review (non-collaborator) + if: steps.check-collab.outputs.is_collab == 'false' + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + curl -s -X POST \ + -H "Authorization: token $GH_TOKEN" \ + "https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/comments" \ + -d '{"body": "Thanks for reporting! This has been queued for review by the infra team."}' + curl -s -X POST \ + -H "Authorization: token $GH_TOKEN" \ + "https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/labels" \ + -d '{"labels": ["needs-human"]}' + + - name: Trigger Woodpecker pipeline (collaborator) + if: steps.check-collab.outputs.is_collab == 'true' + run: | + # Extract labels as comma-separated string + LABELS=$(echo '${{ toJSON(github.event.issue.labels.*.name) }}' | python3 -c "import sys,json; print(','.join(json.load(sys.stdin)))" 2>/dev/null || echo "unknown") + + curl -sf -X POST \ + -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \ + "https://ci.viktorbarzin.me/api/repos/1/pipelines" \ + -d "{ + \"branch\": \"master\", + \"variables\": { + \"ISSUE_NUMBER\": \"${{ github.event.issue.number }}\", + \"ISSUE_TITLE\": $(echo '${{ github.event.issue.title }}' | python3 -c 'import sys,json; print(json.dumps(sys.stdin.read().strip()))'), + \"ISSUE_AUTHOR\": \"${{ github.event.issue.user.login }}\", + \"ISSUE_LABELS\": \"$LABELS\", + \"ISSUE_URL\": \"${{ github.event.issue.html_url }}\" + } + }" diff --git a/.woodpecker/issue-automation.yml b/.woodpecker/issue-automation.yml new file mode 100644 index 00000000..a4786a56 --- /dev/null +++ b/.woodpecker/issue-automation.yml @@ -0,0 +1,60 @@ +when: + event: manual + +clone: + git: + image: woodpeckerci/plugin-git + settings: + depth: 2 + +steps: + - name: run-issue-responder + image: python:3.12-alpine + commands: + - apk add --no-cache openssh-client curl jq + # Authenticate to Vault via K8s SA JWT + - | + SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) + VAULT_RESP=$(curl -sf -X POST http://vault-active.vault.svc.cluster.local:8200/v1/auth/kubernetes/login \ + -d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}") + VAULT_TOKEN=$(echo "$VAULT_RESP" | jq -r .auth.client_token) + if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then + echo "ERROR: Vault authentication failed" + exit 1 + fi + echo "Vault authenticated" + # Fetch DevVM SSH key + - | + curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \ + http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \ + jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key + chmod 600 /tmp/devvm-key + if [ ! -s /tmp/devvm-key ]; then + echo "ERROR: Failed to fetch DevVM SSH key" + exit 1 + fi + echo "SSH key fetched" + # SSH to DevVM and run issue-responder agent + - | + ISSUE_NUM="${CI_PIPELINE_VARIABLE_ISSUE_NUMBER:-}" + ISSUE_TITLE="${CI_PIPELINE_VARIABLE_ISSUE_TITLE:-}" + ISSUE_LABELS="${CI_PIPELINE_VARIABLE_ISSUE_LABELS:-}" + ISSUE_URL="${CI_PIPELINE_VARIABLE_ISSUE_URL:-}" + + if [ -z "$ISSUE_NUM" ]; then + echo "ERROR: No issue number provided" + exit 1 + fi + + echo "Processing issue #$ISSUE_NUM: $ISSUE_TITLE" + echo "Labels: $ISSUE_LABELS" + + ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \ + "cd ~/code/infra && git pull --rebase && \ + ~/.local/bin/claude -p \ + --agent .claude/agents/issue-responder \ + --dangerously-skip-permissions \ + --max-budget-usd 10 \ + 'Process GitHub Issue #${ISSUE_NUM}: ${ISSUE_TITLE}. Labels: ${ISSUE_LABELS}. URL: ${ISSUE_URL}. Read the issue body via GitHub API, investigate, and take appropriate action.'" + # Cleanup + - rm -f /tmp/devvm-key diff --git a/docs/architecture/agent-task-tracking.md b/docs/architecture/agent-task-tracking.md new file mode 100644 index 00000000..e89bee5e --- /dev/null +++ b/docs/architecture/agent-task-tracking.md @@ -0,0 +1,151 @@ +# Agent Task Tracking + +## Overview + +All Claude Code sessions share a centralized task database powered by [Beads](https://github.com/steveyegge/beads) (`bd` CLI) backed by a Dolt SQL server running in the Kubernetes cluster. This prevents agents from duplicating work across sessions and provides persistent cross-session task tracking. + +## Architecture + +``` + ┌─────────────────────────┐ + │ Dolt SQL Server (k8s) │ + │ beads-server namespace │ + │ 10.0.20.200:3306 │ + │ proxmox-lvm PVC (2Gi) │ + └────────┬──────────────────┘ + │ MySQL protocol + ┌──────────────┼──────────────────┐ + │ │ │ + ┌──────────▼──┐ ┌───────▼────────┐ ┌──────▼──────────┐ + │ wizard │ │ emo │ │ future agents │ + │ session 1 │ │ session 1 │ │ (any machine │ + │ session 2 │ │ session 2 │ │ with network │ + │ session N │ │ │ │ access) │ + └─────────────┘ └────────────────┘ └─────────────────┘ +``` + +### Components + +| Component | Location | Purpose | +|-----------|----------|---------| +| Dolt server | `beads-server` namespace, `10.0.20.200:3306` | Centralized MySQL-compatible database | +| Root `.beads/` | `/home/wizard/code/.beads/` | Client config (server mode, prefix `code`) | +| Task context hook | `/home/wizard/.claude/hooks/beads-task-context.sh` | Injects in-progress tasks into every prompt | +| Task blocker hook | `/home/wizard/.claude/hooks/beads-block-builtin-tasks.py` | Blocks TaskCreate/TodoWrite, redirects to `bd` | +| Project settings | `/home/wizard/code/.claude/settings.json` | Shared hooks (inherited by all users) | +| Terraform stack | `stacks/beads-server/` | Deployment, Service (MetalLB LB), PVC | + +### Settings Hierarchy + +``` +Project-level (.claude/settings.json) ← Shared: beads hooks + TaskCreate blocker + └─ User-level (~/.claude/settings.json) ← Per-user: memory plugin, model, statusline +``` + +Both `wizard` and `emo` inherit project-level settings automatically. User-specific hooks (e.g., wizard's memory plugin) stay in the user-level settings. + +## Agent Session Lifecycle + +### 1. Session Start (automatic) + +The `UserPromptSubmit` hook fires on every prompt: +- Queries `bd list --status in_progress` from the centralized DB +- Queries `bd list --status open | head -10` for available work +- Injects results into the agent's context as `additionalContext` + +The agent sees what's currently being worked on before processing any request. + +### 2. Before Starting Work + +```bash +bd list --status in_progress # What others are working on +bd ready # Unblocked tasks available +bd create "Task description" # Register your work +bd update <id> --claim # Set status to in_progress +``` + +### 3. During Work + +```bash +bd note <id> "progress update" # Log progress +bd link <child> <parent> # Add dependencies +``` + +### 4. After Completing Work + +```bash +bd close <id> # Mark complete +bd create "Follow-up task" # File remaining work for next session +``` + +### 5. Enforcement + +Two layers prevent agents from using built-in task tools: + +1. **CLAUDE.md instruction** (soft): "Do NOT use TaskCreate, TaskUpdate, TodoWrite" +2. **PermissionRequest hook** (hard): Blocks the tool call with a deny decision and redirect message + +## Infrastructure + +### Dolt Server + +- **Image**: `dolthub/dolt-sql-server:latest` +- **Storage**: `proxmox-lvm` PVC, 2Gi initial, auto-resize to 10Gi +- **Service**: LoadBalancer via MetalLB on shared IP `10.0.20.200` + - `metallb.io/allow-shared-ip: shared` + - `externalTrafficPolicy: Cluster` +- **Port**: 3306 (MySQL protocol) +- **Users**: `root@%` and `beads@%` (no password, internal network) +- **Init**: `/docker-entrypoint-initdb.d/` via ConfigMap, `DOLT_ROOT_HOST=%` +- **Terraform**: `stacks/beads-server/main.tf` + +### Client Configuration + +The root `.beads/metadata.json`: +```json +{ + "backend": "dolt", + "dolt_mode": "server", + "dolt_server_host": "10.0.20.200", + "dolt_server_port": 3306, + "dolt_server_user": "beads", + "dolt_database": "code" +} +``` + +### Multi-User Access + +- Directory permissions: `2770 wizard:code-shared` (setgid) +- Both `wizard` and `emo` are in the `code-shared` group +- `bd` binary: `/home/wizard/.local/bin/bd` (symlinked for emo at `/home/emo/.local/bin/bd`) + +## Known Issues + +### Subdirectory Shadow + +Per-project `.beads/` directories exist in 7 subdirectories (finance, infra, Website, etc.). When an agent `cd`s into one of these, `bd` auto-discovers the **local** `.beads/` instead of the centralized one. + +**Fix**: Always use `bd --db /home/wizard/code/.beads` when working from a subdirectory. The hook and CLAUDE.md instructions document this. + +### Hook Network Failure + +The task context hook suppresses errors (`2>/dev/null`). If the Dolt server is unreachable, the hook silently exits without injecting context. Agents won't see current tasks but won't be blocked either. + +### Permissions Warning + +`bd` warns about `.beads` directory permissions (`0770 vs recommended 0700`). This is expected — we use `0770` for group access. The warning is harmless. + +## Verification + +Run the E2E test: +```bash +bash /home/wizard/code/test-beads-e2e.sh +``` + +This tests all 11 phases: hook injection, task CRUD, cross-user visibility, subdirectory shadowing, and multi-agent coordination. Expects 11/11 PASS. + +## Related + +- `CLAUDE.md` (root) — Mandatory task protocol section +- Per-project `CLAUDE.md` files — Beads integration block +- `stacks/beads-server/main.tf` — Terraform deployment