feat: post-mortem automation pipeline

E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:34:42 +00:00 · 2026-04-14 15:34:42 +00:00 · 8badb8181a
commit 8badb8181a
parent e832581caf
6 changed files with 406 additions and 0 deletions
--- a/.claude/skills/post-mortem/template.md
+++ b/.claude/skills/post-mortem/template.md
@ -0,0 +1,85 @@
+# Post-Mortem: <TITLE>
+
+| Field | Value |
+|-------|-------|
+| **Date** | <DATE> |
+| **Duration** | <DURATION> |
+| **Severity** | <SEV1/SEV2/SEV3> |
+| **Affected Services** | <COUNT> pods across <COUNT> namespaces |
+| **Status** | Draft |
+
+## Summary
+
+<1-2 sentence summary of the incident.>
+
+## Impact
+
+- **User-facing**: <What users experienced>
+- **Blast radius**: <How many services/pods/namespaces affected>
+- **Duration**: <How long the outage lasted>
+- **Data loss**: <None/details>
+- **Monitoring gap**: <Any blind spots in alerting>
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| **HH:MM** | <First sign of trouble> |
+| **HH:MM** | <Detection / user report> |
+| **HH:MM** | <Investigation begins> |
+| **HH:MM** | <Root cause identified> |
+| **HH:MM** | <Fix applied> |
+| **HH:MM** | <Service restored> |
+
+## Root Cause
+
+<Narrative description of what went wrong and why.>
+
+## Contributing Factors
+
+1. <Factor that made the incident worse or harder to detect>
+2. <Factor...>
+
+## Detection Gaps
+
+| Gap | Impact | Fix |
+|-----|--------|-----|
+| <What wasn't monitored> | <How it delayed detection> | <What to add> |
+
+## Prevention Plan
+
+### P0 — Prevent this exact failure
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P0 | <action> | Config | <details> | TODO |
+
+### P1 — Reduce blast radius
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P1 | <action> | Alert | <details> | TODO |
+
+### P2 — Detect faster
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P2 | <action> | Monitor | <details> | TODO |
+
+### P3 — Improve resilience
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P3 | <action> | Architecture | <details> | TODO |
+
+## Lessons Learned
+
+1. <Key takeaway>
+2. <Key takeaway>
+
+## Follow-up Implementation
+
+_This section is auto-populated by the postmortem-todo-resolver agent._
+
+| Date | Action | Priority | Type | Commit | Implemented By |
+|------|--------|----------|------|--------|----------------|