Add agent task tracking documentation

Documents the centralized Beads/Dolt task tracking system used by all Claude Code sessions. Covers architecture, session lifecycle, settings hierarchy, known issues, and E2E test verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:11:26 +00:00 · 2026-04-15 17:11:26 +00:00 · 7bb9ec2934
commit 7bb9ec2934
parent 9baefa22ab
6 changed files with 493 additions and 0 deletions
--- a/.claude/agents/issue-responder.md
+++ b/.claude/agents/issue-responder.md
@ -0,0 +1,180 @@
+---
+name: issue-responder
+description: "Automated infra team: reads GitHub Issues (incidents + feature requests), investigates, resolves if confident, escalates if complex."
+model: opus
+allowedTools:
+  - Read
+  - Edit
+  - Write
+  - Bash
+  - Grep
+  - Glob
+  - Agent
+---
+
+You are the automated infra team responder for ViktorBarzin/infra. You receive a GitHub Issue (incident report or feature request), investigate, and take action.
+
+## Environment
+
+- **Infra repo**: `/home/wizard/code/infra`
+- **GitHub repo**: `ViktorBarzin/infra`
+- **GitHub PAT**: `vault kv get -field=github_pat secret/viktor`
+- **Cluster context script**: `/home/wizard/code/infra/.claude/scripts/sev-context.sh`
+- **Post-mortem agents**: `/home/wizard/code/infra/.claude/agents/post-mortem.md` (4-stage pipeline)
+- **Service catalog**: `/home/wizard/code/infra/.claude/reference/service-catalog.md`
+- **Terraform apply**: `cd /home/wizard/code/infra/stacks/<stack> && ../../scripts/tg apply --non-interactive`
+
+## Input
+
+You receive a prompt like:
+> Process GitHub Issue #N: <title>. Labels: <labels>. URL: <url>. Read the issue body via GitHub API, investigate, and take appropriate action.
+
+## Step 1: Read the Issue
+
+```bash
+GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
+curl -s -H "Authorization: token $GITHUB_TOKEN" \
+  "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" | python3 -c "
+import sys, json
+d = json.load(sys.stdin)
+print(f'Title: {d[\"title\"]}')
+print(f'Author: {d[\"user\"][\"login\"]}')
+print(f'Labels: {[l[\"name\"] for l in d[\"labels\"]]}')
+print(f'State: {d[\"state\"]}')
+print(f'Body:\n{d[\"body\"]}')
+"
+```
+
+## Step 2: Classify and Route
+
+Based on labels:
+- `user-report` → **Incident Response** (Step 3A)
+- `feature-request` → **Feature Implementation** (Step 3B)
+- Neither → Read the issue body, determine which it is, add the appropriate label, then route
+
+## Step 3A: Incident Response
+
+1. **Verify the issue is real**:
+   - Run `bash /home/wizard/code/infra/.claude/scripts/sev-context.sh` for cluster state
+   - Check if the reported service is actually down: `kubectl get pods -n <namespace>`, check Uptime Kuma
+   - If service appears healthy: comment "Service appears healthy from our monitoring. Could you provide more details or check again?" and close the issue
+   
+2. **If service is down**:
+   - Classify severity:
+     - **SEV1**: Node down, multiple services affected, data at risk, or complete outage of a core service (DNS, auth, ingress)
+     - **SEV2**: Single service down, degraded performance, or non-core service outage
+     - **SEV3**: Minor issue, cosmetic, or affecting only optional services
+   - Add labels: `incident` + `sev1`/`sev2`/`sev3` + `postmortem-required` (for SEV1/SEV2)
+   - Comment on the issue: "Investigating. Severity classified as SEV<N>."
+
+3. **Attempt resolution** (if confident):
+   - Check pod logs, events, recent deployments for obvious causes
+   - Common fixes you CAN do:
+     - Restart a stuck pod: `kubectl delete pod -n <ns> <pod>`
+     - Scale deployment back up if scaled to 0
+     - Fix obvious Terraform config issues (wrong image tag, resource limits)
+     - Apply Terraform: `cd stacks/<stack> && ../../scripts/tg apply --non-interactive`
+   - If you fix it: comment with what was done, how it was resolved
+   - If you can't fix it or it's complex: escalate (see Step 4)
+
+4. **For SEV1/SEV2**: Spawn the post-mortem pipeline via Agent tool:
+   ```
+   Agent(subagent_type="general-purpose", prompt="Run the post-mortem agent pipeline for issue #N...")
+   ```
+
+## Step 3B: Feature Implementation
+
+1. **Assess complexity**:
+   - Read the request carefully
+   - Check if it's a known pattern (deploy a service, add a monitor, config change)
+   - Check existing stacks in `stacks/` for similar services as reference
+
+2. **If trivial** (you're confident you can implement correctly):
+   - Implement the change in Terraform
+   - **Always run `scripts/tg plan`** before apply — check for unexpected changes
+   - If plan looks clean: apply via `scripts/tg apply --non-interactive`
+   - Commit: `git add <files> && git commit -m "feat: <description> (fixes #N)"`
+   - Push: `git push origin master`
+   - Comment on the issue with what was implemented
+   - Close the issue
+
+3. **If complex** (new architecture, unknown service, multi-stack changes, data migration):
+   - Comment with your assessment: what's needed, estimated complexity, any risks
+   - Escalate (see Step 4)
+
+## Step 4: Escalate
+
+When you can't confidently resolve an issue:
+
+```bash
+GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
+
+# Add needs-human label
+curl -s -X POST \
+  -H "Authorization: token $GITHUB_TOKEN" \
+  "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" \
+  -d '{"labels": ["needs-human"]}'
+
+# Assign to Viktor
+curl -s -X POST \
+  -H "Authorization: token $GITHUB_TOKEN" \
+  "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/assignees" \
+  -d '{"assignees": ["ViktorBarzin"]}'
+
+# Comment explaining why
+curl -s -X POST \
+  -H "Authorization: token $GITHUB_TOKEN" \
+  "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
+  -d "{\"body\": \"**Escalating to @ViktorBarzin** — <reason>\\n\\n**What I found:**\\n<findings>\\n\\n**Why I can't resolve this:**\\n<reason>\"}"
+```
+
+## Safety Rules
+
+1. **Never delete PVCs, PVs, or user data**
+2. **Never modify Vault secrets directly** — use Terraform + ExternalSecrets
+3. **Never force-push or git reset**
+4. **Never apply changes that could cause downtime to HEALTHY services**
+5. **Always `scripts/tg plan` before `scripts/tg apply`** — if plan shows destroys > 0, ESCALATE
+6. **Never modify platform stacks** (vault, dbaas, traefik, authentik, kyverno) — ESCALATE these
+7. **All changes go through Terraform** — never kubectl apply/edit/patch as final state
+8. **Max budget**: $10 per issue. If you need more, escalate.
+9. **All commits reference the issue**: `fixes #N` or `ref #N`
+
+## Communication
+
+All updates go as GitHub Issue comments. Use this format:
+
+**Starting investigation:**
+> Investigating issue #N. Running cluster diagnostics...
+
+**Findings:**
+> **Findings:** <what you found>
+> - Pod `X` in namespace `Y` is in CrashLoopBackOff
+> - Last restart: 15 minutes ago
+> - Error in logs: `<error>`
+
+**Resolution:**
+> **Resolved:** <what was done>
+> - Restarted pod `X` — service recovered
+> - Root cause: OOM kill due to memory limit. Increased limit from 512Mi to 1Gi.
+> - Commit: `abc1234`
+
+**Escalation:**
+> **Escalating to @ViktorBarzin** — <brief reason>
+> **What I found:** <details>
+> **Why I can't resolve this:** <reason>
+
+## Commit Convention
+
+```
+feat: <description> (fixes #N)
+
+Co-Authored-By: issue-responder <noreply@anthropic.com>
+```
+
+Or for incident fixes:
+```
+fix: <description> (fixes #N)
+
+Co-Authored-By: issue-responder <noreply@anthropic.com>
+```
--- a/.claude/skills/cluster-health/SKILL.md
+++ b/.claude/skills/cluster-health/SKILL.md
@ -295,6 +295,40 @@ The webhook URL is passed as an environment variable from `openclaw_skill_secret
 | kubectl (in pod) | `/tools/kubectl` |
 | terraform (in pod) | `/tools/terraform` |

+## Auto-File Incidents for SEV1/SEV2
+
+After running health checks, if **SEV1 or SEV2 issues** are found (node down, multiple services affected, core service outage, or single important service down), auto-file a GitHub Issue:
+
+### Severity Classification
+- **SEV1**: Node NotReady, multiple services down, data at risk, core service outage (DNS, auth, ingress, databases)
+- **SEV2**: Single non-core service down, degraded performance, persistent CrashLoopBackOff
+- **SEV3**: Warnings only, resource pressure <90%, cosmetic — do NOT auto-file
+
+### Workflow
+1. **Dedup check**: Before filing, query open incidents:
+   ```bash
+   GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
+   curl -s -H "Authorization: token $GITHUB_TOKEN" \
+     "https://api.github.com/repos/ViktorBarzin/infra/issues?labels=incident&state=open&per_page=50"
+   ```
+   If an open issue already covers the same service/namespace, **skip filing**.
+
+2. **File the issue** with labels `incident`, `sev1` or `sev2`, `postmortem-required`:
+   - Title: `[AUTO] <Service/Namespace> — <brief symptom>`
+   - Body: full diagnostic dump (pod status, events, alerts, node state)
+   - The issue-automation GHA workflow will trigger the post-mortem pipeline automatically
+
+3. **Auto-close recovered services**: If a service that previously had an auto-filed incident is now healthy:
+   ```bash
+   # Comment and close
+   curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" \
+     "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
+     -d '{"body": "**Resolved** — Service recovered. Auto-closed by cluster health check."}'
+   curl -s -X PATCH -H "Authorization: token $GITHUB_TOKEN" \
+     "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" \
+     -d '{"state": "closed"}'
+   ```
+
 ## Post-Mortem Auto-Suggest

 After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:
--- a/.github/ISSUE_TEMPLATE/feature-request.yml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yml
@ -0,0 +1,12 @@
+name: Feature Request
+description: Request a new service, configuration change, or improvement
+labels: ["feature-request"]
+body:
+  - type: textarea
+    id: description
+    attributes:
+      label: What do you need?
+      description: Describe what you'd like. Be as specific as possible.
+      placeholder: "e.g., Deploy Obsidian for note-taking, or add a new Uptime Kuma monitor for service X"
+    validations:
+      required: true
--- a/.github/workflows/issue-automation.yml
+++ b/.github/workflows/issue-automation.yml
@ -0,0 +1,56 @@
+name: Issue Automation
+on:
+  issues:
+    types: [opened, labeled]
+
+jobs:
+  process-issue:
+    if: |
+      contains(github.event.issue.labels.*.name, 'user-report') ||
+      contains(github.event.issue.labels.*.name, 'feature-request')
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check if author is collaborator
+        id: check-collab
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \
+            -H "Authorization: token $GH_TOKEN" \
+            "https://api.github.com/repos/${{ github.repository }}/collaborators/${{ github.event.issue.user.login }}")
+          echo "is_collab=$([[ $RESPONSE == '204' ]] && echo 'true' || echo 'false')" >> $GITHUB_OUTPUT
+          echo "Author: ${{ github.event.issue.user.login }}, Collaborator: $RESPONSE"
+
+      - name: Queue for review (non-collaborator)
+        if: steps.check-collab.outputs.is_collab == 'false'
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          curl -s -X POST \
+            -H "Authorization: token $GH_TOKEN" \
+            "https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/comments" \
+            -d '{"body": "Thanks for reporting! This has been queued for review by the infra team."}'
+          curl -s -X POST \
+            -H "Authorization: token $GH_TOKEN" \
+            "https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/labels" \
+            -d '{"labels": ["needs-human"]}'
+
+      - name: Trigger Woodpecker pipeline (collaborator)
+        if: steps.check-collab.outputs.is_collab == 'true'
+        run: |
+          # Extract labels as comma-separated string
+          LABELS=$(echo '${{ toJSON(github.event.issue.labels.*.name) }}' | python3 -c "import sys,json; print(','.join(json.load(sys.stdin)))" 2>/dev/null || echo "unknown")
+
+          curl -sf -X POST \
+            -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \
+            "https://ci.viktorbarzin.me/api/repos/1/pipelines" \
+            -d "{
+              \"branch\": \"master\",
+              \"variables\": {
+                \"ISSUE_NUMBER\": \"${{ github.event.issue.number }}\",
+                \"ISSUE_TITLE\": $(echo '${{ github.event.issue.title }}' | python3 -c 'import sys,json; print(json.dumps(sys.stdin.read().strip()))'),
+                \"ISSUE_AUTHOR\": \"${{ github.event.issue.user.login }}\",
+                \"ISSUE_LABELS\": \"$LABELS\",
+                \"ISSUE_URL\": \"${{ github.event.issue.html_url }}\"
+              }
+            }"
--- a/.woodpecker/issue-automation.yml
+++ b/.woodpecker/issue-automation.yml
@ -0,0 +1,60 @@
+when:
+  event: manual
+
+clone:
+  git:
+    image: woodpeckerci/plugin-git
+    settings:
+      depth: 2
+
+steps:
+  - name: run-issue-responder
+    image: python:3.12-alpine
+    commands:
+      - apk add --no-cache openssh-client curl jq
+      # Authenticate to Vault via K8s SA JWT
+      - |
+        SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
+        VAULT_RESP=$(curl -sf -X POST http://vault-active.vault.svc.cluster.local:8200/v1/auth/kubernetes/login \
+          -d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}")
+        VAULT_TOKEN=$(echo "$VAULT_RESP" | jq -r .auth.client_token)
+        if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
+          echo "ERROR: Vault authentication failed"
+          exit 1
+        fi
+        echo "Vault authenticated"
+      # Fetch DevVM SSH key
+      - |
+        curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
+          http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \
+          jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key
+        chmod 600 /tmp/devvm-key
+        if [ ! -s /tmp/devvm-key ]; then
+          echo "ERROR: Failed to fetch DevVM SSH key"
+          exit 1
+        fi
+        echo "SSH key fetched"
+      # SSH to DevVM and run issue-responder agent
+      - |
+        ISSUE_NUM="${CI_PIPELINE_VARIABLE_ISSUE_NUMBER:-}"
+        ISSUE_TITLE="${CI_PIPELINE_VARIABLE_ISSUE_TITLE:-}"
+        ISSUE_LABELS="${CI_PIPELINE_VARIABLE_ISSUE_LABELS:-}"
+        ISSUE_URL="${CI_PIPELINE_VARIABLE_ISSUE_URL:-}"
+
+        if [ -z "$ISSUE_NUM" ]; then
+          echo "ERROR: No issue number provided"
+          exit 1
+        fi
+
+        echo "Processing issue #$ISSUE_NUM: $ISSUE_TITLE"
+        echo "Labels: $ISSUE_LABELS"
+
+        ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \
+          "cd ~/code/infra && git pull --rebase && \
+           ~/.local/bin/claude -p \
+             --agent .claude/agents/issue-responder \
+             --dangerously-skip-permissions \
+             --max-budget-usd 10 \
+             'Process GitHub Issue #${ISSUE_NUM}: ${ISSUE_TITLE}. Labels: ${ISSUE_LABELS}. URL: ${ISSUE_URL}. Read the issue body via GitHub API, investigate, and take appropriate action.'"
+      # Cleanup
+      - rm -f /tmp/devvm-key
--- a/docs/architecture/agent-task-tracking.md
+++ b/docs/architecture/agent-task-tracking.md
@ -0,0 +1,151 @@
+# Agent Task Tracking
+
+## Overview
+
+All Claude Code sessions share a centralized task database powered by [Beads](https://github.com/steveyegge/beads) (`bd` CLI) backed by a Dolt SQL server running in the Kubernetes cluster. This prevents agents from duplicating work across sessions and provides persistent cross-session task tracking.
+
+## Architecture
+
+```
+                     ┌─────────────────────────┐
+                     │  Dolt SQL Server (k8s)   │
+                     │  beads-server namespace   │
+                     │  10.0.20.200:3306         │
+                     │  proxmox-lvm PVC (2Gi)    │
+                     └────────┬──────────────────┘
+                              │ MySQL protocol
+               ┌──────────────┼──────────────────┐
+               │              │                   │
+    ┌──────────▼──┐  ┌───────▼────────┐  ┌──────▼──────────┐
+    │ wizard      │  │ emo            │  │ future agents   │
+    │ session 1   │  │ session 1      │  │ (any machine    │
+    │ session 2   │  │ session 2      │  │  with network   │
+    │ session N   │  │                │  │  access)        │
+    └─────────────┘  └────────────────┘  └─────────────────┘
+```
+
+### Components
+
+| Component | Location | Purpose |
+|-----------|----------|---------|
+| Dolt server | `beads-server` namespace, `10.0.20.200:3306` | Centralized MySQL-compatible database |
+| Root `.beads/` | `/home/wizard/code/.beads/` | Client config (server mode, prefix `code`) |
+| Task context hook | `/home/wizard/.claude/hooks/beads-task-context.sh` | Injects in-progress tasks into every prompt |
+| Task blocker hook | `/home/wizard/.claude/hooks/beads-block-builtin-tasks.py` | Blocks TaskCreate/TodoWrite, redirects to `bd` |
+| Project settings | `/home/wizard/code/.claude/settings.json` | Shared hooks (inherited by all users) |
+| Terraform stack | `stacks/beads-server/` | Deployment, Service (MetalLB LB), PVC |
+
+### Settings Hierarchy
+
+```
+Project-level (.claude/settings.json)     ← Shared: beads hooks + TaskCreate blocker
+  └─ User-level (~/.claude/settings.json) ← Per-user: memory plugin, model, statusline
+```
+
+Both `wizard` and `emo` inherit project-level settings automatically. User-specific hooks (e.g., wizard's memory plugin) stay in the user-level settings.
+
+## Agent Session Lifecycle
+
+### 1. Session Start (automatic)
+
+The `UserPromptSubmit` hook fires on every prompt:
+- Queries `bd list --status in_progress` from the centralized DB
+- Queries `bd list --status open | head -10` for available work
+- Injects results into the agent's context as `additionalContext`
+
+The agent sees what's currently being worked on before processing any request.
+
+### 2. Before Starting Work
+
+```bash
+bd list --status in_progress    # What others are working on
+bd ready                        # Unblocked tasks available
+bd create "Task description"    # Register your work
+bd update <id> --claim          # Set status to in_progress
+```
+
+### 3. During Work
+
+```bash
+bd note <id> "progress update"  # Log progress
+bd link <child> <parent>        # Add dependencies
+```
+
+### 4. After Completing Work
+
+```bash
+bd close <id>                   # Mark complete
+bd create "Follow-up task"      # File remaining work for next session
+```
+
+### 5. Enforcement
+
+Two layers prevent agents from using built-in task tools:
+
+1. **CLAUDE.md instruction** (soft): "Do NOT use TaskCreate, TaskUpdate, TodoWrite"
+2. **PermissionRequest hook** (hard): Blocks the tool call with a deny decision and redirect message
+
+## Infrastructure
+
+### Dolt Server
+
+- **Image**: `dolthub/dolt-sql-server:latest`
+- **Storage**: `proxmox-lvm` PVC, 2Gi initial, auto-resize to 10Gi
+- **Service**: LoadBalancer via MetalLB on shared IP `10.0.20.200`
+  - `metallb.io/allow-shared-ip: shared`
+  - `externalTrafficPolicy: Cluster`
+- **Port**: 3306 (MySQL protocol)
+- **Users**: `root@%` and `beads@%` (no password, internal network)
+- **Init**: `/docker-entrypoint-initdb.d/` via ConfigMap, `DOLT_ROOT_HOST=%`
+- **Terraform**: `stacks/beads-server/main.tf`
+
+### Client Configuration
+
+The root `.beads/metadata.json`:
+```json
+{
+  "backend": "dolt",
+  "dolt_mode": "server",
+  "dolt_server_host": "10.0.20.200",
+  "dolt_server_port": 3306,
+  "dolt_server_user": "beads",
+  "dolt_database": "code"
+}
+```
+
+### Multi-User Access
+
+- Directory permissions: `2770 wizard:code-shared` (setgid)
+- Both `wizard` and `emo` are in the `code-shared` group
+- `bd` binary: `/home/wizard/.local/bin/bd` (symlinked for emo at `/home/emo/.local/bin/bd`)
+
+## Known Issues
+
+### Subdirectory Shadow
+
+Per-project `.beads/` directories exist in 7 subdirectories (finance, infra, Website, etc.). When an agent `cd`s into one of these, `bd` auto-discovers the **local** `.beads/` instead of the centralized one.
+
+**Fix**: Always use `bd --db /home/wizard/code/.beads` when working from a subdirectory. The hook and CLAUDE.md instructions document this.
+
+### Hook Network Failure
+
+The task context hook suppresses errors (`2>/dev/null`). If the Dolt server is unreachable, the hook silently exits without injecting context. Agents won't see current tasks but won't be blocked either.
+
+### Permissions Warning
+
+`bd` warns about `.beads` directory permissions (`0770 vs recommended 0700`). This is expected — we use `0770` for group access. The warning is harmless.
+
+## Verification
+
+Run the E2E test:
+```bash
+bash /home/wizard/code/test-beads-e2e.sh
+```
+
+This tests all 11 phases: hook injection, task CRUD, cross-user visibility, subdirectory shadowing, and multi-agent coordination. Expects 11/11 PASS.
+
+## Related
+
+- `CLAUDE.md` (root) — Mandatory task protocol section
+- Per-project `CLAUDE.md` files — Beads integration block
+- `stacks/beads-server/main.tf` — Terraform deployment