feat: post-mortem automation pipeline

E2E workflow for incident post-mortems:
1. /post-mortem skill generates structured post-mortem markdown
2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes
3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor)
4. postmortem-todo-resolver agent implements TODOs headlessly
5. Agent updates post-mortem with Follow-up Implementation table

Components:
- .claude/skills/post-mortem/ — writer skill + template
- .claude/agents/postmortem-todo-resolver.md — headless agent
- .woodpecker/postmortem-todos.yml — CI pipeline
- scripts/parse-postmortem-todos.sh — TODO extractor
- cluster-health skill — auto-suggest post-mortem after recovery

Safety: only auto-implements Alert/Config/Monitor types.
Architecture/Migration/Investigation items are skipped.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-14 15:34:42 +00:00
parent e832581caf
commit 8badb8181a
6 changed files with 406 additions and 0 deletions

View file

@ -295,6 +295,14 @@ The webhook URL is passed as an environment variable from `openclaw_skill_secret
| kubectl (in pod) | `/tools/kubectl` |
| terraform (in pod) | `/tools/terraform` |
## Post-Mortem Auto-Suggest
After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:
> The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run `/post-mortem` to generate one.
This ensures incidents are documented while context is fresh.
## Notes
1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount

View file

@ -0,0 +1,55 @@
# Post-Mortem Writer
Generate a structured post-mortem document after an incident mitigation session.
## When to use
- After `/post-mortem` command
- Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY
## Instructions
1. **Gather context**:
- Run `.claude/scripts/sev-context.sh` to capture current cluster state
- Review the conversation history for: what broke, timeline, root cause, what was fixed
- Check existing post-mortems at `docs/post-mortems/` for format reference
2. **Generate the post-mortem**:
- Use the template at `.claude/skills/post-mortem/template.md`
- Fill in all sections from the investigation context
- **Critical**: In the Prevention Plan tables, set the `Type` column correctly:
- `Alert` — add/modify Prometheus alerting rules (auto-implementable)
- `Config` — change Terraform config, NFS options, etc. (auto-implementable)
- `Monitor` — add Uptime Kuma monitors (auto-implementable)
- `Architecture` — storage migration, stack redesign (human-only)
- `Investigation` — needs further research (human-only)
- `Runbook` — document a procedure (human-only)
- `Migration` — data or service migration (human-only)
- Items already fixed during the session should have Status = `Done`
- Items not yet done should have Status = `TODO`
3. **File naming**: `docs/post-mortems/<YYYY-MM-DD>-<slug>.md`
- Slug: lowercase, hyphenated, max 5 words describing the incident
4. **Update index**: Add an entry to `docs/post-mortems/index.html`
- Add a new card in the incidents grid with date, severity tag, title, description
5. **Commit and push**:
```
git add docs/post-mortems/<file>.md docs/post-mortems/index.html
git commit -m "docs: post-mortem for <date> <title> [ci skip]"
git push origin master
```
- Use `[ci skip]` to avoid triggering app-stacks pipeline
- NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter)
## Type Reference for Prevention Plan
| Type | Auto-implementable? | Examples |
|------|---------------------|----------|
| Alert | Yes | Add PrometheusRule, modify alert thresholds |
| Config | Yes | Change Terraform variables, mount options, CronJob schedules |
| Monitor | Yes | Add Uptime Kuma HTTP/TCP monitor |
| Architecture | No | Migrate storage class, redesign HA topology |
| Investigation | No | Research kernel bug, check Proxmox forum |
| Runbook | No | Document recovery procedure |
| Migration | No | Move data between storage backends |

View file

@ -0,0 +1,85 @@
# Post-Mortem: <TITLE>
| Field | Value |
|-------|-------|
| **Date** | <DATE> |
| **Duration** | <DURATION> |
| **Severity** | <SEV1/SEV2/SEV3> |
| **Affected Services** | <COUNT> pods across <COUNT> namespaces |
| **Status** | Draft |
## Summary
<1-2 sentence summary of the incident.>
## Impact
- **User-facing**: <What users experienced>
- **Blast radius**: <How many services/pods/namespaces affected>
- **Duration**: <How long the outage lasted>
- **Data loss**: <None/details>
- **Monitoring gap**: <Any blind spots in alerting>
## Timeline (UTC)
| Time | Event |
|------|-------|
| **HH:MM** | <First sign of trouble> |
| **HH:MM** | <Detection / user report> |
| **HH:MM** | <Investigation begins> |
| **HH:MM** | <Root cause identified> |
| **HH:MM** | <Fix applied> |
| **HH:MM** | <Service restored> |
## Root Cause
<Narrative description of what went wrong and why.>
## Contributing Factors
1. <Factor that made the incident worse or harder to detect>
2. <Factor...>
## Detection Gaps
| Gap | Impact | Fix |
|-----|--------|-----|
| <What wasn't monitored> | <How it delayed detection> | <What to add> |
## Prevention Plan
### P0 — Prevent this exact failure
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P0 | <action> | Config | <details> | TODO |
### P1 — Reduce blast radius
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P1 | <action> | Alert | <details> | TODO |
### P2 — Detect faster
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P2 | <action> | Monitor | <details> | TODO |
### P3 — Improve resilience
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P3 | <action> | Architecture | <details> | TODO |
## Lessons Learned
1. <Key takeaway>
2. <Key takeaway>
## Follow-up Implementation
_This section is auto-populated by the postmortem-todo-resolver agent._
| Date | Action | Priority | Type | Commit | Implemented By |
|------|--------|----------|------|--------|----------------|