Viktor Barzin 8badb8181a feat: post-mortem automation pipeline

E2E workflow for incident post-mortems:
1. /post-mortem skill generates structured post-mortem markdown
2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes
3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor)
4. postmortem-todo-resolver agent implements TODOs headlessly
5. Agent updates post-mortem with Follow-up Implementation table

Components:
- .claude/skills/post-mortem/ — writer skill + template
- .claude/agents/postmortem-todo-resolver.md — headless agent
- .woodpecker/postmortem-todos.yml — CI pipeline
- scripts/parse-postmortem-todos.sh — TODO extractor
- cluster-health skill — auto-suggest post-mortem after recovery

Safety: only auto-implements Alert/Config/Monitor types.
Architecture/Migration/Investigation items are skipped.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 15:34:42 +00:00

2.5 KiB

Raw Blame History

Post-Mortem Writer

Generate a structured post-mortem document after an incident mitigation session.

When to use

After /post-mortem command
Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY

Instructions

Gather context:
- Run .claude/scripts/sev-context.sh to capture current cluster state
- Review the conversation history for: what broke, timeline, root cause, what was fixed
- Check existing post-mortems at docs/post-mortems/ for format reference
Generate the post-mortem:
- Use the template at .claude/skills/post-mortem/template.md
- Fill in all sections from the investigation context
- Critical: In the Prevention Plan tables, set the Type column correctly:
  - Alert — add/modify Prometheus alerting rules (auto-implementable)
  - Config — change Terraform config, NFS options, etc. (auto-implementable)
  - Monitor — add Uptime Kuma monitors (auto-implementable)
  - Architecture — storage migration, stack redesign (human-only)
  - Investigation — needs further research (human-only)
  - Runbook — document a procedure (human-only)
  - Migration — data or service migration (human-only)
- Items already fixed during the session should have Status = Done
- Items not yet done should have Status = TODO
File naming: docs/post-mortems/<YYYY-MM-DD>-<slug>.md
- Slug: lowercase, hyphenated, max 5 words describing the incident
Update index: Add an entry to docs/post-mortems/index.html
- Add a new card in the incidents grid with date, severity tag, title, description
Commit and push:
```
git add docs/post-mortems/<file>.md docs/post-mortems/index.html
git commit -m "docs: post-mortem for <date> <title> [ci skip]"
git push origin master
```
- Use [ci skip] to avoid triggering app-stacks pipeline
- NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter)

Type Reference for Prevention Plan

Type	Auto-implementable?	Examples
Alert	Yes	Add PrometheusRule, modify alert thresholds
Config	Yes	Change Terraform variables, mount options, CronJob schedules
Monitor	Yes	Add Uptime Kuma HTTP/TCP monitor
Architecture	No	Migrate storage class, redesign HA topology
Investigation	No	Research kernel bug, check Proxmox forum
Runbook	No	Document recovery procedure
Migration	No	Move data between storage backends

2.5 KiB Raw Blame History

Post-Mortem Writer

When to use

Instructions

Type Reference for Prevention Plan

2.5 KiB

Raw Blame History