feat: post-mortem automation pipeline
E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e832581caf
commit
8badb8181a
6 changed files with 406 additions and 0 deletions
89
.claude/agents/postmortem-todo-resolver.md
Normal file
89
.claude/agents/postmortem-todo-resolver.md
Normal file
|
|
@ -0,0 +1,89 @@
|
|||
---
|
||||
name: postmortem-todo-resolver
|
||||
description: Implements safe TODOs from post-mortem Prevention Plans. Triggered by Woodpecker pipeline on new post-mortem commits.
|
||||
model: sonnet
|
||||
allowedTools:
|
||||
- Read
|
||||
- Edit
|
||||
- Write
|
||||
- Bash
|
||||
- Grep
|
||||
- Glob
|
||||
- Agent
|
||||
---
|
||||
|
||||
You are the post-mortem TODO resolver. You implement **safe** infrastructure TODOs extracted from post-mortem documents in the ViktorBarzin/infra repository.
|
||||
|
||||
## Safety Rules
|
||||
|
||||
1. **ONLY implement TODOs with Type: `Alert`, `Config`, or `Monitor`**
|
||||
2. **SKIP TODOs with Type: `Architecture`, `Investigation`, `Runbook`, `Migration`** — add them to the Follow-up table as "Needs human review"
|
||||
3. **Always run `scripts/tg plan` before apply** — ABORT if plan shows any destroys > 0
|
||||
4. **Never modify platform stacks** (vault, dbaas, traefik, authentik, kyverno) without explicit approval
|
||||
5. **Max budget**: Stop after 30 minutes per TODO or $5 total
|
||||
6. **All changes MUST go through Terraform** — never kubectl apply/edit/patch as final state
|
||||
|
||||
## Commit Convention
|
||||
|
||||
Each TODO fix gets its own commit:
|
||||
```
|
||||
fix(post-mortem): <action description> [PM-YYYY-MM-DD]
|
||||
|
||||
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
### For each safe TODO (in priority order P0 → P3):
|
||||
|
||||
1. **Read** the relevant Terraform files mentioned in the TODO details
|
||||
2. **Implement** the change:
|
||||
- PrometheusRule → edit `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`
|
||||
- Uptime Kuma monitor → use the uptime-kuma skill
|
||||
- Config changes → edit the relevant stack's `.tf` files
|
||||
3. **Test**: `cd` to the stack directory, run `scripts/tg plan`, verify the change is safe
|
||||
4. **Apply**: `scripts/tg apply --non-interactive`
|
||||
5. **Commit**: `git add` the changed files + state, commit with the convention above
|
||||
6. **Record**: Note the commit SHA for the Follow-up table
|
||||
|
||||
### After all TODOs processed:
|
||||
|
||||
1. **Update the post-mortem file**:
|
||||
- In Prevention Plan tables: change `TODO` → `Done` for implemented items
|
||||
- Append/update the **Follow-up Implementation** section at the bottom with a table:
|
||||
|
||||
```markdown
|
||||
## Follow-up Implementation
|
||||
|
||||
| Date | Action | Priority | Type | Commit | Implemented By |
|
||||
|------|--------|----------|------|--------|----------------|
|
||||
| YYYY-MM-DD | <action> | P0 | Config | [`abc1234`](https://github.com/ViktorBarzin/infra/commit/abc1234) | postmortem-todo-resolver |
|
||||
| — | <skipped action> | P1 | Architecture | — | Needs human review |
|
||||
```
|
||||
|
||||
2. **Commit the post-mortem update**:
|
||||
```
|
||||
git commit -m "docs: update post-mortem follow-up implementation [PM-YYYY-MM-DD] [ci skip]"
|
||||
```
|
||||
|
||||
3. **Push all changes**: `git push origin master`
|
||||
|
||||
## Context
|
||||
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Terraform stacks**: `stacks/<name>/`
|
||||
- **Apply tool**: `scripts/tg apply --non-interactive` (handles state encryption)
|
||||
- **Prometheus alerts**: `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`
|
||||
- **Post-mortems**: `docs/post-mortems/`
|
||||
- **GitHub repo**: `https://github.com/ViktorBarzin/infra`
|
||||
|
||||
## Example
|
||||
|
||||
Given a TODO: `| P2 | Add PrometheusRule for NFS mount failures | Alert | kube_pod_container_status_waiting_reason with NFS volume filter | TODO |`
|
||||
|
||||
1. Read `prometheus_chart_values.tpl` to find the right alert group
|
||||
2. Add the new alert rule in the appropriate group
|
||||
3. `cd stacks/monitoring && scripts/tg plan` → verify 0 destroys
|
||||
4. `scripts/tg apply --non-interactive`
|
||||
5. `git add . && git commit -m "fix(post-mortem): add NFS mount failure PrometheusRule [PM-2026-04-14]"`
|
||||
6. Update post-mortem: `TODO` → `Done`, add commit to Follow-up table
|
||||
|
|
@ -295,6 +295,14 @@ The webhook URL is passed as an environment variable from `openclaw_skill_secret
|
|||
| kubectl (in pod) | `/tools/kubectl` |
|
||||
| terraform (in pod) | `/tools/terraform` |
|
||||
|
||||
## Post-Mortem Auto-Suggest
|
||||
|
||||
After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:
|
||||
|
||||
> The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run `/post-mortem` to generate one.
|
||||
|
||||
This ensures incidents are documented while context is fresh.
|
||||
|
||||
## Notes
|
||||
|
||||
1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount
|
||||
|
|
|
|||
55
.claude/skills/post-mortem/skill.md
Normal file
55
.claude/skills/post-mortem/skill.md
Normal file
|
|
@ -0,0 +1,55 @@
|
|||
# Post-Mortem Writer
|
||||
|
||||
Generate a structured post-mortem document after an incident mitigation session.
|
||||
|
||||
## When to use
|
||||
- After `/post-mortem` command
|
||||
- Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY
|
||||
|
||||
## Instructions
|
||||
|
||||
1. **Gather context**:
|
||||
- Run `.claude/scripts/sev-context.sh` to capture current cluster state
|
||||
- Review the conversation history for: what broke, timeline, root cause, what was fixed
|
||||
- Check existing post-mortems at `docs/post-mortems/` for format reference
|
||||
|
||||
2. **Generate the post-mortem**:
|
||||
- Use the template at `.claude/skills/post-mortem/template.md`
|
||||
- Fill in all sections from the investigation context
|
||||
- **Critical**: In the Prevention Plan tables, set the `Type` column correctly:
|
||||
- `Alert` — add/modify Prometheus alerting rules (auto-implementable)
|
||||
- `Config` — change Terraform config, NFS options, etc. (auto-implementable)
|
||||
- `Monitor` — add Uptime Kuma monitors (auto-implementable)
|
||||
- `Architecture` — storage migration, stack redesign (human-only)
|
||||
- `Investigation` — needs further research (human-only)
|
||||
- `Runbook` — document a procedure (human-only)
|
||||
- `Migration` — data or service migration (human-only)
|
||||
- Items already fixed during the session should have Status = `Done`
|
||||
- Items not yet done should have Status = `TODO`
|
||||
|
||||
3. **File naming**: `docs/post-mortems/<YYYY-MM-DD>-<slug>.md`
|
||||
- Slug: lowercase, hyphenated, max 5 words describing the incident
|
||||
|
||||
4. **Update index**: Add an entry to `docs/post-mortems/index.html`
|
||||
- Add a new card in the incidents grid with date, severity tag, title, description
|
||||
|
||||
5. **Commit and push**:
|
||||
```
|
||||
git add docs/post-mortems/<file>.md docs/post-mortems/index.html
|
||||
git commit -m "docs: post-mortem for <date> <title> [ci skip]"
|
||||
git push origin master
|
||||
```
|
||||
- Use `[ci skip]` to avoid triggering app-stacks pipeline
|
||||
- NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter)
|
||||
|
||||
## Type Reference for Prevention Plan
|
||||
|
||||
| Type | Auto-implementable? | Examples |
|
||||
|------|---------------------|----------|
|
||||
| Alert | Yes | Add PrometheusRule, modify alert thresholds |
|
||||
| Config | Yes | Change Terraform variables, mount options, CronJob schedules |
|
||||
| Monitor | Yes | Add Uptime Kuma HTTP/TCP monitor |
|
||||
| Architecture | No | Migrate storage class, redesign HA topology |
|
||||
| Investigation | No | Research kernel bug, check Proxmox forum |
|
||||
| Runbook | No | Document recovery procedure |
|
||||
| Migration | No | Move data between storage backends |
|
||||
85
.claude/skills/post-mortem/template.md
Normal file
85
.claude/skills/post-mortem/template.md
Normal file
|
|
@ -0,0 +1,85 @@
|
|||
# Post-Mortem: <TITLE>
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | <DATE> |
|
||||
| **Duration** | <DURATION> |
|
||||
| **Severity** | <SEV1/SEV2/SEV3> |
|
||||
| **Affected Services** | <COUNT> pods across <COUNT> namespaces |
|
||||
| **Status** | Draft |
|
||||
|
||||
## Summary
|
||||
|
||||
<1-2 sentence summary of the incident.>
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: <What users experienced>
|
||||
- **Blast radius**: <How many services/pods/namespaces affected>
|
||||
- **Duration**: <How long the outage lasted>
|
||||
- **Data loss**: <None/details>
|
||||
- **Monitoring gap**: <Any blind spots in alerting>
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| **HH:MM** | <First sign of trouble> |
|
||||
| **HH:MM** | <Detection / user report> |
|
||||
| **HH:MM** | <Investigation begins> |
|
||||
| **HH:MM** | <Root cause identified> |
|
||||
| **HH:MM** | <Fix applied> |
|
||||
| **HH:MM** | <Service restored> |
|
||||
|
||||
## Root Cause
|
||||
|
||||
<Narrative description of what went wrong and why.>
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
1. <Factor that made the incident worse or harder to detect>
|
||||
2. <Factor...>
|
||||
|
||||
## Detection Gaps
|
||||
|
||||
| Gap | Impact | Fix |
|
||||
|-----|--------|-----|
|
||||
| <What wasn't monitored> | <How it delayed detection> | <What to add> |
|
||||
|
||||
## Prevention Plan
|
||||
|
||||
### P0 — Prevent this exact failure
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P0 | <action> | Config | <details> | TODO |
|
||||
|
||||
### P1 — Reduce blast radius
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P1 | <action> | Alert | <details> | TODO |
|
||||
|
||||
### P2 — Detect faster
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P2 | <action> | Monitor | <details> | TODO |
|
||||
|
||||
### P3 — Improve resilience
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P3 | <action> | Architecture | <details> | TODO |
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. <Key takeaway>
|
||||
2. <Key takeaway>
|
||||
|
||||
## Follow-up Implementation
|
||||
|
||||
_This section is auto-populated by the postmortem-todo-resolver agent._
|
||||
|
||||
| Date | Action | Priority | Type | Commit | Implemented By |
|
||||
|------|--------|----------|------|--------|----------------|
|
||||
80
.woodpecker/postmortem-todos.yml
Normal file
80
.woodpecker/postmortem-todos.yml
Normal file
|
|
@ -0,0 +1,80 @@
|
|||
when:
|
||||
event: push
|
||||
branch: master
|
||||
path:
|
||||
include:
|
||||
- 'docs/post-mortems/*.md'
|
||||
exclude:
|
||||
- '.woodpecker/**'
|
||||
|
||||
steps:
|
||||
- name: parse-todos
|
||||
image: python:3.12-alpine
|
||||
commands:
|
||||
- apk add --no-cache jq git openssh-client
|
||||
# Find which post-mortem changed
|
||||
- PM_FILE=$(git diff HEAD~1 --name-only | grep 'docs/post-mortems/.*\.md' | head -1)
|
||||
- |
|
||||
if [ -z "$PM_FILE" ]; then
|
||||
echo "No post-mortem markdown changes detected"
|
||||
echo '{"skip": true}' > /tmp/todos.json
|
||||
exit 0
|
||||
fi
|
||||
- echo "Post-mortem changed: $PM_FILE"
|
||||
# Check if there are new TODOs (not just TODO→Done updates)
|
||||
- |
|
||||
if ! git diff HEAD~1 -- "$PM_FILE" | grep -q '+.*| TODO |'; then
|
||||
echo "No new TODOs added (only status updates)"
|
||||
echo '{"skip": true}' > /tmp/todos.json
|
||||
exit 0
|
||||
fi
|
||||
# Parse TODOs
|
||||
- python3 scripts/parse-postmortem-todos.sh "$PM_FILE" > /tmp/todos.json || bash scripts/parse-postmortem-todos.sh "$PM_FILE" > /tmp/todos.json
|
||||
- cat /tmp/todos.json
|
||||
- TODO_COUNT=$(jq '.safe_todos' /tmp/todos.json)
|
||||
- echo "$TODO_COUNT auto-implementable TODO(s) found"
|
||||
- |
|
||||
if [ "$TODO_COUNT" -eq 0 ]; then
|
||||
echo "No auto-implementable TODOs (all are Architecture/Investigation/Migration type)"
|
||||
echo '{"skip": true}' > /tmp/todos.json
|
||||
fi
|
||||
|
||||
- name: implement-todos
|
||||
image: alpine
|
||||
commands:
|
||||
- |
|
||||
if [ "$(jq -r '.skip // empty' /tmp/todos.json 2>/dev/null)" = "true" ]; then
|
||||
echo "Skipping — no TODOs to implement"
|
||||
exit 0
|
||||
fi
|
||||
- apk add --no-cache openssh-client jq
|
||||
- PM_FILE=$(jq -r '.file' /tmp/todos.json)
|
||||
- PM_DATE=$(echo "$PM_FILE" | grep -oP '\d{4}-\d{2}-\d{2}')
|
||||
- TODOS=$(cat /tmp/todos.json)
|
||||
# SSH to DevVM and run Claude Code in headless mode
|
||||
- |
|
||||
ssh -o StrictHostKeyChecking=no wizard@10.0.10.10 \
|
||||
"cd ~/code/infra && git pull && claude -p \
|
||||
--agent postmortem-todo-resolver \
|
||||
--dangerously-skip-permissions \
|
||||
--max-budget-usd 5 \
|
||||
'Implement the auto-implementable TODOs from $PM_FILE. Here is the parsed TODO list: $TODOS'"
|
||||
secrets:
|
||||
- ssh_deploy_key
|
||||
|
||||
- name: notify-slack
|
||||
image: alpine
|
||||
commands:
|
||||
- apk add --no-cache curl jq
|
||||
- |
|
||||
PM_FILE=$(jq -r '.file // "unknown"' /tmp/todos.json 2>/dev/null)
|
||||
SAFE=$(jq -r '.safe_todos // 0' /tmp/todos.json 2>/dev/null)
|
||||
SKIPPED=$(jq -r '.skipped_todos // 0' /tmp/todos.json 2>/dev/null)
|
||||
STATUS="${CI_PIPELINE_STATUS:-unknown}"
|
||||
curl -sf -X POST "$SLACK_WEBHOOK_URL" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "{\"text\": \"*Post-mortem TODO resolver* ($STATUS)\\n• File: \`$PM_FILE\`\\n• Safe TODOs processed: $SAFE\\n• Skipped (needs human): $SKIPPED\\n• Pipeline: ${CI_PIPELINE_URL:-N/A}\"}" || true
|
||||
secrets:
|
||||
- slack_webhook
|
||||
when:
|
||||
- status: [success, failure]
|
||||
89
scripts/parse-postmortem-todos.sh
Executable file
89
scripts/parse-postmortem-todos.sh
Executable file
|
|
@ -0,0 +1,89 @@
|
|||
#!/usr/bin/env bash
|
||||
# parse-postmortem-todos.sh — Extract auto-implementable TODOs from a post-mortem markdown file
|
||||
# Usage: bash scripts/parse-postmortem-todos.sh docs/post-mortems/2026-04-14-foo.md
|
||||
# Output: JSON with file path and list of TODOs
|
||||
#
|
||||
# Supports two table formats:
|
||||
# New: | Priority | Action | Type | Details | Status |
|
||||
# Old: | Action | Status | Details | (infers type from action text)
|
||||
set -euo pipefail
|
||||
|
||||
PM_FILE="${1:?Usage: $0 <post-mortem.md>}"
|
||||
|
||||
if [ ! -f "$PM_FILE" ]; then
|
||||
echo '{"file": "", "todos": [], "error": "File not found"}' >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
python3 -c "
|
||||
import re, json, sys
|
||||
|
||||
pm_file = sys.argv[1]
|
||||
with open(pm_file) as f:
|
||||
content = f.read()
|
||||
|
||||
safe_types = {'Alert', 'Config', 'Monitor'}
|
||||
|
||||
todos = []
|
||||
|
||||
# Format 1 (new template): | Priority | Action | Type | Details | Status |
|
||||
pattern_new = r'\|\s*(P[0-3])\s*\|\s*(.+?)\s*\|\s*(\w+)\s*\|\s*(.+?)\s*\|\s*TODO\s*\|'
|
||||
for priority, action, todo_type, details in re.findall(pattern_new, content):
|
||||
todos.append({
|
||||
'priority': priority.strip(),
|
||||
'action': action.strip(),
|
||||
'type': todo_type.strip(),
|
||||
'details': details.strip(),
|
||||
'safe': todo_type.strip() in safe_types
|
||||
})
|
||||
|
||||
# Format 2 (old): | Action | TODO/Done | Details | or | Action | Owner | Status |
|
||||
# Look for rows with TODO in any column
|
||||
if not todos:
|
||||
pattern_old = r'\|\s*(.+?)\s*\|\s*TODO\s*\|\s*(.+?)\s*\|'
|
||||
for action, details in re.findall(pattern_old, content):
|
||||
action = action.strip()
|
||||
details = details.strip()
|
||||
# Skip header rows and clean up leading pipes
|
||||
if action.startswith('--') or action.lower() == 'action':
|
||||
continue
|
||||
action = action.lstrip('| ').strip()
|
||||
# Infer type from action text
|
||||
action_lower = action.lower()
|
||||
if any(kw in action_lower for kw in ['prometheusrule', 'alert', 'alerting']):
|
||||
todo_type = 'Alert'
|
||||
elif any(kw in action_lower for kw in ['uptime kuma', 'monitor', 'ping', 'tcp check']):
|
||||
todo_type = 'Monitor'
|
||||
elif any(kw in action_lower for kw in ['config', 'manage', 'add.*option', 'document', 'nfs.conf']):
|
||||
todo_type = 'Config'
|
||||
elif any(kw in action_lower for kw in ['migrate', 'move']):
|
||||
todo_type = 'Migration'
|
||||
elif any(kw in action_lower for kw in ['review', 'investigate', 'verify']):
|
||||
todo_type = 'Investigation'
|
||||
else:
|
||||
todo_type = 'Config' # default to Config for ambiguous items
|
||||
|
||||
# Infer priority from section header context
|
||||
priority = 'P2' # default
|
||||
todos.append({
|
||||
'priority': priority,
|
||||
'action': action,
|
||||
'type': todo_type,
|
||||
'details': details,
|
||||
'safe': todo_type in safe_types
|
||||
})
|
||||
|
||||
safe_todos = [t for t in todos if t['safe']]
|
||||
unsafe_todos = [t for t in todos if not t['safe']]
|
||||
|
||||
result = {
|
||||
'file': pm_file,
|
||||
'todos': safe_todos,
|
||||
'skipped': unsafe_todos,
|
||||
'total_todos_in_doc': len(todos),
|
||||
'safe_todos': len(safe_todos),
|
||||
'skipped_todos': len(unsafe_todos)
|
||||
}
|
||||
|
||||
print(json.dumps(result, indent=2))
|
||||
" "$PM_FILE"
|
||||
Loading…
Add table
Add a link
Reference in a new issue