E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.5 KiB
2.5 KiB
Post-Mortem Writer
Generate a structured post-mortem document after an incident mitigation session.
When to use
- After
/post-mortemcommand - Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY
Instructions
-
Gather context:
- Run
.claude/scripts/sev-context.shto capture current cluster state - Review the conversation history for: what broke, timeline, root cause, what was fixed
- Check existing post-mortems at
docs/post-mortems/for format reference
- Run
-
Generate the post-mortem:
- Use the template at
.claude/skills/post-mortem/template.md - Fill in all sections from the investigation context
- Critical: In the Prevention Plan tables, set the
Typecolumn correctly:Alert— add/modify Prometheus alerting rules (auto-implementable)Config— change Terraform config, NFS options, etc. (auto-implementable)Monitor— add Uptime Kuma monitors (auto-implementable)Architecture— storage migration, stack redesign (human-only)Investigation— needs further research (human-only)Runbook— document a procedure (human-only)Migration— data or service migration (human-only)
- Items already fixed during the session should have Status =
Done - Items not yet done should have Status =
TODO
- Use the template at
-
File naming:
docs/post-mortems/<YYYY-MM-DD>-<slug>.md- Slug: lowercase, hyphenated, max 5 words describing the incident
-
Update index: Add an entry to
docs/post-mortems/index.html- Add a new card in the incidents grid with date, severity tag, title, description
-
Commit and push:
git add docs/post-mortems/<file>.md docs/post-mortems/index.html git commit -m "docs: post-mortem for <date> <title> [ci skip]" git push origin master- Use
[ci skip]to avoid triggering app-stacks pipeline - NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter)
- Use
Type Reference for Prevention Plan
| Type | Auto-implementable? | Examples |
|---|---|---|
| Alert | Yes | Add PrometheusRule, modify alert thresholds |
| Config | Yes | Change Terraform variables, mount options, CronJob schedules |
| Monitor | Yes | Add Uptime Kuma HTTP/TCP monitor |
| Architecture | No | Migrate storage class, redesign HA topology |
| Investigation | No | Research kernel bug, check Proxmox forum |
| Runbook | No | Document recovery procedure |
| Migration | No | Move data between storage backends |