diff --git a/.claude/agents/postmortem-todo-resolver.md b/.claude/agents/postmortem-todo-resolver.md new file mode 100644 index 00000000..b9fa80db --- /dev/null +++ b/.claude/agents/postmortem-todo-resolver.md @@ -0,0 +1,89 @@ +--- +name: postmortem-todo-resolver +description: Implements safe TODOs from post-mortem Prevention Plans. Triggered by Woodpecker pipeline on new post-mortem commits. +model: sonnet +allowedTools: + - Read + - Edit + - Write + - Bash + - Grep + - Glob + - Agent +--- + +You are the post-mortem TODO resolver. You implement **safe** infrastructure TODOs extracted from post-mortem documents in the ViktorBarzin/infra repository. + +## Safety Rules + +1. **ONLY implement TODOs with Type: `Alert`, `Config`, or `Monitor`** +2. **SKIP TODOs with Type: `Architecture`, `Investigation`, `Runbook`, `Migration`** — add them to the Follow-up table as "Needs human review" +3. **Always run `scripts/tg plan` before apply** — ABORT if plan shows any destroys > 0 +4. **Never modify platform stacks** (vault, dbaas, traefik, authentik, kyverno) without explicit approval +5. **Max budget**: Stop after 30 minutes per TODO or $5 total +6. **All changes MUST go through Terraform** — never kubectl apply/edit/patch as final state + +## Commit Convention + +Each TODO fix gets its own commit: +``` +fix(post-mortem): [PM-YYYY-MM-DD] + +Co-Authored-By: postmortem-todo-resolver +``` + +## Workflow + +### For each safe TODO (in priority order P0 → P3): + +1. **Read** the relevant Terraform files mentioned in the TODO details +2. **Implement** the change: + - PrometheusRule → edit `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` + - Uptime Kuma monitor → use the uptime-kuma skill + - Config changes → edit the relevant stack's `.tf` files +3. **Test**: `cd` to the stack directory, run `scripts/tg plan`, verify the change is safe +4. **Apply**: `scripts/tg apply --non-interactive` +5. **Commit**: `git add` the changed files + state, commit with the convention above +6. **Record**: Note the commit SHA for the Follow-up table + +### After all TODOs processed: + +1. **Update the post-mortem file**: + - In Prevention Plan tables: change `TODO` → `Done` for implemented items + - Append/update the **Follow-up Implementation** section at the bottom with a table: + + ```markdown + ## Follow-up Implementation + + | Date | Action | Priority | Type | Commit | Implemented By | + |------|--------|----------|------|--------|----------------| + | YYYY-MM-DD | | P0 | Config | [`abc1234`](https://github.com/ViktorBarzin/infra/commit/abc1234) | postmortem-todo-resolver | + | — | | P1 | Architecture | — | Needs human review | + ``` + +2. **Commit the post-mortem update**: + ``` + git commit -m "docs: update post-mortem follow-up implementation [PM-YYYY-MM-DD] [ci skip]" + ``` + +3. **Push all changes**: `git push origin master` + +## Context + +- **Infra repo**: `/home/wizard/code/infra` +- **Terraform stacks**: `stacks//` +- **Apply tool**: `scripts/tg apply --non-interactive` (handles state encryption) +- **Prometheus alerts**: `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` +- **Post-mortems**: `docs/post-mortems/` +- **GitHub repo**: `https://github.com/ViktorBarzin/infra` + +## Example + +Given a TODO: `| P2 | Add PrometheusRule for NFS mount failures | Alert | kube_pod_container_status_waiting_reason with NFS volume filter | TODO |` + +1. Read `prometheus_chart_values.tpl` to find the right alert group +2. Add the new alert rule in the appropriate group +3. `cd stacks/monitoring && scripts/tg plan` → verify 0 destroys +4. `scripts/tg apply --non-interactive` +5. `git add . && git commit -m "fix(post-mortem): add NFS mount failure PrometheusRule [PM-2026-04-14]"` +6. Update post-mortem: `TODO` → `Done`, add commit to Follow-up table diff --git a/.claude/skills/cluster-health/SKILL.md b/.claude/skills/cluster-health/SKILL.md index 23beb1ef..7408bc1f 100644 --- a/.claude/skills/cluster-health/SKILL.md +++ b/.claude/skills/cluster-health/SKILL.md @@ -295,6 +295,14 @@ The webhook URL is passed as an environment variable from `openclaw_skill_secret | kubectl (in pod) | `/tools/kubectl` | | terraform (in pod) | `/tools/terraform` | +## Post-Mortem Auto-Suggest + +After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem: + +> The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run `/post-mortem` to generate one. + +This ensures incidents are documented while context is fresh. + ## Notes 1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount diff --git a/.claude/skills/post-mortem/skill.md b/.claude/skills/post-mortem/skill.md new file mode 100644 index 00000000..6457f650 --- /dev/null +++ b/.claude/skills/post-mortem/skill.md @@ -0,0 +1,55 @@ +# Post-Mortem Writer + +Generate a structured post-mortem document after an incident mitigation session. + +## When to use +- After `/post-mortem` command +- Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY + +## Instructions + +1. **Gather context**: + - Run `.claude/scripts/sev-context.sh` to capture current cluster state + - Review the conversation history for: what broke, timeline, root cause, what was fixed + - Check existing post-mortems at `docs/post-mortems/` for format reference + +2. **Generate the post-mortem**: + - Use the template at `.claude/skills/post-mortem/template.md` + - Fill in all sections from the investigation context + - **Critical**: In the Prevention Plan tables, set the `Type` column correctly: + - `Alert` — add/modify Prometheus alerting rules (auto-implementable) + - `Config` — change Terraform config, NFS options, etc. (auto-implementable) + - `Monitor` — add Uptime Kuma monitors (auto-implementable) + - `Architecture` — storage migration, stack redesign (human-only) + - `Investigation` — needs further research (human-only) + - `Runbook` — document a procedure (human-only) + - `Migration` — data or service migration (human-only) + - Items already fixed during the session should have Status = `Done` + - Items not yet done should have Status = `TODO` + +3. **File naming**: `docs/post-mortems/-.md` + - Slug: lowercase, hyphenated, max 5 words describing the incident + +4. **Update index**: Add an entry to `docs/post-mortems/index.html` + - Add a new card in the incidents grid with date, severity tag, title, description + +5. **Commit and push**: + ``` + git add docs/post-mortems/.md docs/post-mortems/index.html + git commit -m "docs: post-mortem for [ci skip]" + git push origin master + ``` + - Use `[ci skip]` to avoid triggering app-stacks pipeline + - NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter) + +## Type Reference for Prevention Plan + +| Type | Auto-implementable? | Examples | +|------|---------------------|----------| +| Alert | Yes | Add PrometheusRule, modify alert thresholds | +| Config | Yes | Change Terraform variables, mount options, CronJob schedules | +| Monitor | Yes | Add Uptime Kuma HTTP/TCP monitor | +| Architecture | No | Migrate storage class, redesign HA topology | +| Investigation | No | Research kernel bug, check Proxmox forum | +| Runbook | No | Document recovery procedure | +| Migration | No | Move data between storage backends | diff --git a/.claude/skills/post-mortem/template.md b/.claude/skills/post-mortem/template.md new file mode 100644 index 00000000..cda9b6da --- /dev/null +++ b/.claude/skills/post-mortem/template.md @@ -0,0 +1,85 @@ +# Post-Mortem: <TITLE> + +| Field | Value | +|-------|-------| +| **Date** | <DATE> | +| **Duration** | <DURATION> | +| **Severity** | <SEV1/SEV2/SEV3> | +| **Affected Services** | <COUNT> pods across <COUNT> namespaces | +| **Status** | Draft | + +## Summary + +<1-2 sentence summary of the incident.> + +## Impact + +- **User-facing**: <What users experienced> +- **Blast radius**: <How many services/pods/namespaces affected> +- **Duration**: <How long the outage lasted> +- **Data loss**: <None/details> +- **Monitoring gap**: <Any blind spots in alerting> + +## Timeline (UTC) + +| Time | Event | +|------|-------| +| **HH:MM** | <First sign of trouble> | +| **HH:MM** | <Detection / user report> | +| **HH:MM** | <Investigation begins> | +| **HH:MM** | <Root cause identified> | +| **HH:MM** | <Fix applied> | +| **HH:MM** | <Service restored> | + +## Root Cause + +<Narrative description of what went wrong and why.> + +## Contributing Factors + +1. <Factor that made the incident worse or harder to detect> +2. <Factor...> + +## Detection Gaps + +| Gap | Impact | Fix | +|-----|--------|-----| +| <What wasn't monitored> | <How it delayed detection> | <What to add> | + +## Prevention Plan + +### P0 — Prevent this exact failure + +| Priority | Action | Type | Details | Status | +|----------|--------|------|---------|--------| +| P0 | <action> | Config | <details> | TODO | + +### P1 — Reduce blast radius + +| Priority | Action | Type | Details | Status | +|----------|--------|------|---------|--------| +| P1 | <action> | Alert | <details> | TODO | + +### P2 — Detect faster + +| Priority | Action | Type | Details | Status | +|----------|--------|------|---------|--------| +| P2 | <action> | Monitor | <details> | TODO | + +### P3 — Improve resilience + +| Priority | Action | Type | Details | Status | +|----------|--------|------|---------|--------| +| P3 | <action> | Architecture | <details> | TODO | + +## Lessons Learned + +1. <Key takeaway> +2. <Key takeaway> + +## Follow-up Implementation + +_This section is auto-populated by the postmortem-todo-resolver agent._ + +| Date | Action | Priority | Type | Commit | Implemented By | +|------|--------|----------|------|--------|----------------| diff --git a/.woodpecker/postmortem-todos.yml b/.woodpecker/postmortem-todos.yml new file mode 100644 index 00000000..28a6880f --- /dev/null +++ b/.woodpecker/postmortem-todos.yml @@ -0,0 +1,80 @@ +when: + event: push + branch: master + path: + include: + - 'docs/post-mortems/*.md' + exclude: + - '.woodpecker/**' + +steps: + - name: parse-todos + image: python:3.12-alpine + commands: + - apk add --no-cache jq git openssh-client + # Find which post-mortem changed + - PM_FILE=$(git diff HEAD~1 --name-only | grep 'docs/post-mortems/.*\.md' | head -1) + - | + if [ -z "$PM_FILE" ]; then + echo "No post-mortem markdown changes detected" + echo '{"skip": true}' > /tmp/todos.json + exit 0 + fi + - echo "Post-mortem changed: $PM_FILE" + # Check if there are new TODOs (not just TODO→Done updates) + - | + if ! git diff HEAD~1 -- "$PM_FILE" | grep -q '+.*| TODO |'; then + echo "No new TODOs added (only status updates)" + echo '{"skip": true}' > /tmp/todos.json + exit 0 + fi + # Parse TODOs + - python3 scripts/parse-postmortem-todos.sh "$PM_FILE" > /tmp/todos.json || bash scripts/parse-postmortem-todos.sh "$PM_FILE" > /tmp/todos.json + - cat /tmp/todos.json + - TODO_COUNT=$(jq '.safe_todos' /tmp/todos.json) + - echo "$TODO_COUNT auto-implementable TODO(s) found" + - | + if [ "$TODO_COUNT" -eq 0 ]; then + echo "No auto-implementable TODOs (all are Architecture/Investigation/Migration type)" + echo '{"skip": true}' > /tmp/todos.json + fi + + - name: implement-todos + image: alpine + commands: + - | + if [ "$(jq -r '.skip // empty' /tmp/todos.json 2>/dev/null)" = "true" ]; then + echo "Skipping — no TODOs to implement" + exit 0 + fi + - apk add --no-cache openssh-client jq + - PM_FILE=$(jq -r '.file' /tmp/todos.json) + - PM_DATE=$(echo "$PM_FILE" | grep -oP '\d{4}-\d{2}-\d{2}') + - TODOS=$(cat /tmp/todos.json) + # SSH to DevVM and run Claude Code in headless mode + - | + ssh -o StrictHostKeyChecking=no wizard@10.0.10.10 \ + "cd ~/code/infra && git pull && claude -p \ + --agent postmortem-todo-resolver \ + --dangerously-skip-permissions \ + --max-budget-usd 5 \ + 'Implement the auto-implementable TODOs from $PM_FILE. Here is the parsed TODO list: $TODOS'" + secrets: + - ssh_deploy_key + + - name: notify-slack + image: alpine + commands: + - apk add --no-cache curl jq + - | + PM_FILE=$(jq -r '.file // "unknown"' /tmp/todos.json 2>/dev/null) + SAFE=$(jq -r '.safe_todos // 0' /tmp/todos.json 2>/dev/null) + SKIPPED=$(jq -r '.skipped_todos // 0' /tmp/todos.json 2>/dev/null) + STATUS="${CI_PIPELINE_STATUS:-unknown}" + curl -sf -X POST "$SLACK_WEBHOOK_URL" \ + -H "Content-Type: application/json" \ + -d "{\"text\": \"*Post-mortem TODO resolver* ($STATUS)\\n• File: \`$PM_FILE\`\\n• Safe TODOs processed: $SAFE\\n• Skipped (needs human): $SKIPPED\\n• Pipeline: ${CI_PIPELINE_URL:-N/A}\"}" || true + secrets: + - slack_webhook + when: + - status: [success, failure] diff --git a/scripts/parse-postmortem-todos.sh b/scripts/parse-postmortem-todos.sh new file mode 100755 index 00000000..b9973f40 --- /dev/null +++ b/scripts/parse-postmortem-todos.sh @@ -0,0 +1,89 @@ +#!/usr/bin/env bash +# parse-postmortem-todos.sh — Extract auto-implementable TODOs from a post-mortem markdown file +# Usage: bash scripts/parse-postmortem-todos.sh docs/post-mortems/2026-04-14-foo.md +# Output: JSON with file path and list of TODOs +# +# Supports two table formats: +# New: | Priority | Action | Type | Details | Status | +# Old: | Action | Status | Details | (infers type from action text) +set -euo pipefail + +PM_FILE="${1:?Usage: $0 <post-mortem.md>}" + +if [ ! -f "$PM_FILE" ]; then + echo '{"file": "", "todos": [], "error": "File not found"}' >&2 + exit 1 +fi + +python3 -c " +import re, json, sys + +pm_file = sys.argv[1] +with open(pm_file) as f: + content = f.read() + +safe_types = {'Alert', 'Config', 'Monitor'} + +todos = [] + +# Format 1 (new template): | Priority | Action | Type | Details | Status | +pattern_new = r'\|\s*(P[0-3])\s*\|\s*(.+?)\s*\|\s*(\w+)\s*\|\s*(.+?)\s*\|\s*TODO\s*\|' +for priority, action, todo_type, details in re.findall(pattern_new, content): + todos.append({ + 'priority': priority.strip(), + 'action': action.strip(), + 'type': todo_type.strip(), + 'details': details.strip(), + 'safe': todo_type.strip() in safe_types + }) + +# Format 2 (old): | Action | TODO/Done | Details | or | Action | Owner | Status | +# Look for rows with TODO in any column +if not todos: + pattern_old = r'\|\s*(.+?)\s*\|\s*TODO\s*\|\s*(.+?)\s*\|' + for action, details in re.findall(pattern_old, content): + action = action.strip() + details = details.strip() + # Skip header rows and clean up leading pipes + if action.startswith('--') or action.lower() == 'action': + continue + action = action.lstrip('| ').strip() + # Infer type from action text + action_lower = action.lower() + if any(kw in action_lower for kw in ['prometheusrule', 'alert', 'alerting']): + todo_type = 'Alert' + elif any(kw in action_lower for kw in ['uptime kuma', 'monitor', 'ping', 'tcp check']): + todo_type = 'Monitor' + elif any(kw in action_lower for kw in ['config', 'manage', 'add.*option', 'document', 'nfs.conf']): + todo_type = 'Config' + elif any(kw in action_lower for kw in ['migrate', 'move']): + todo_type = 'Migration' + elif any(kw in action_lower for kw in ['review', 'investigate', 'verify']): + todo_type = 'Investigation' + else: + todo_type = 'Config' # default to Config for ambiguous items + + # Infer priority from section header context + priority = 'P2' # default + todos.append({ + 'priority': priority, + 'action': action, + 'type': todo_type, + 'details': details, + 'safe': todo_type in safe_types + }) + +safe_todos = [t for t in todos if t['safe']] +unsafe_todos = [t for t in todos if not t['safe']] + +result = { + 'file': pm_file, + 'todos': safe_todos, + 'skipped': unsafe_todos, + 'total_todos_in_doc': len(todos), + 'safe_todos': len(safe_todos), + 'skipped_todos': len(unsafe_todos) +} + +print(json.dumps(result, indent=2)) +" "$PM_FILE"