feat: post-mortem automation pipeline

E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:34:42 +00:00 · 2026-04-14 15:34:42 +00:00 · 8badb8181a
commit 8badb8181a
parent e832581caf
6 changed files with 406 additions and 0 deletions
--- a/.claude/agents/postmortem-todo-resolver.md
+++ b/.claude/agents/postmortem-todo-resolver.md
@ -0,0 +1,89 @@
+---
+name: postmortem-todo-resolver
+description: Implements safe TODOs from post-mortem Prevention Plans. Triggered by Woodpecker pipeline on new post-mortem commits.
+model: sonnet
+allowedTools:
+  - Read
+  - Edit
+  - Write
+  - Bash
+  - Grep
+  - Glob
+  - Agent
+---
+
+You are the post-mortem TODO resolver. You implement **safe** infrastructure TODOs extracted from post-mortem documents in the ViktorBarzin/infra repository.
+
+## Safety Rules
+
+1. **ONLY implement TODOs with Type: `Alert`, `Config`, or `Monitor`**
+2. **SKIP TODOs with Type: `Architecture`, `Investigation`, `Runbook`, `Migration`** — add them to the Follow-up table as "Needs human review"
+3. **Always run `scripts/tg plan` before apply** — ABORT if plan shows any destroys > 0
+4. **Never modify platform stacks** (vault, dbaas, traefik, authentik, kyverno) without explicit approval
+5. **Max budget**: Stop after 30 minutes per TODO or $5 total
+6. **All changes MUST go through Terraform** — never kubectl apply/edit/patch as final state
+
+## Commit Convention
+
+Each TODO fix gets its own commit:
+```
+fix(post-mortem): <action description> [PM-YYYY-MM-DD]
+
+Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
+```
+
+## Workflow
+
+### For each safe TODO (in priority order P0 → P3):
+
+1. **Read** the relevant Terraform files mentioned in the TODO details
+2. **Implement** the change:
+   - PrometheusRule → edit `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`
+   - Uptime Kuma monitor → use the uptime-kuma skill
+   - Config changes → edit the relevant stack's `.tf` files
+3. **Test**: `cd` to the stack directory, run `scripts/tg plan`, verify the change is safe
+4. **Apply**: `scripts/tg apply --non-interactive`
+5. **Commit**: `git add` the changed files + state, commit with the convention above
+6. **Record**: Note the commit SHA for the Follow-up table
+
+### After all TODOs processed:
+
+1. **Update the post-mortem file**:
+   - In Prevention Plan tables: change `TODO` → `Done` for implemented items
+   - Append/update the **Follow-up Implementation** section at the bottom with a table:
+
+   ```markdown
+   ## Follow-up Implementation
+
+   | Date | Action | Priority | Type | Commit | Implemented By |
+   |------|--------|----------|------|--------|----------------|
+   | YYYY-MM-DD | <action> | P0 | Config | [`abc1234`](https://github.com/ViktorBarzin/infra/commit/abc1234) | postmortem-todo-resolver |
+   | — | <skipped action> | P1 | Architecture | — | Needs human review |
+   ```
+
+2. **Commit the post-mortem update**:
+   ```
+   git commit -m "docs: update post-mortem follow-up implementation [PM-YYYY-MM-DD] [ci skip]"
+   ```
+
+3. **Push all changes**: `git push origin master`
+
+## Context
+
+- **Infra repo**: `/home/wizard/code/infra`
+- **Terraform stacks**: `stacks/<name>/`
+- **Apply tool**: `scripts/tg apply --non-interactive` (handles state encryption)
+- **Prometheus alerts**: `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`
+- **Post-mortems**: `docs/post-mortems/`
+- **GitHub repo**: `https://github.com/ViktorBarzin/infra`
+
+## Example
+
+Given a TODO: `| P2 | Add PrometheusRule for NFS mount failures | Alert | kube_pod_container_status_waiting_reason with NFS volume filter | TODO |`
+
+1. Read `prometheus_chart_values.tpl` to find the right alert group
+2. Add the new alert rule in the appropriate group
+3. `cd stacks/monitoring && scripts/tg plan` → verify 0 destroys
+4. `scripts/tg apply --non-interactive`
+5. `git add . && git commit -m "fix(post-mortem): add NFS mount failure PrometheusRule [PM-2026-04-14]"`
+6. Update post-mortem: `TODO` → `Done`, add commit to Follow-up table
--- a/.claude/skills/cluster-health/SKILL.md
+++ b/.claude/skills/cluster-health/SKILL.md
@ -295,6 +295,14 @@ The webhook URL is passed as an environment variable from `openclaw_skill_secret
 | kubectl (in pod) | `/tools/kubectl` |
 | terraform (in pod) | `/tools/terraform` |

+## Post-Mortem Auto-Suggest
+
+After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:
+
+> The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run `/post-mortem` to generate one.
+
+This ensures incidents are documented while context is fresh.
+
 ## Notes

 1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount
--- a/.claude/skills/post-mortem/skill.md
+++ b/.claude/skills/post-mortem/skill.md
@ -0,0 +1,55 @@
+# Post-Mortem Writer
+
+Generate a structured post-mortem document after an incident mitigation session.
+
+## When to use
+- After `/post-mortem` command
+- Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY
+
+## Instructions
+
+1. **Gather context**:
+   - Run `.claude/scripts/sev-context.sh` to capture current cluster state
+   - Review the conversation history for: what broke, timeline, root cause, what was fixed
+   - Check existing post-mortems at `docs/post-mortems/` for format reference
+
+2. **Generate the post-mortem**:
+   - Use the template at `.claude/skills/post-mortem/template.md`
+   - Fill in all sections from the investigation context
+   - **Critical**: In the Prevention Plan tables, set the `Type` column correctly:
+     - `Alert` — add/modify Prometheus alerting rules (auto-implementable)
+     - `Config` — change Terraform config, NFS options, etc. (auto-implementable)
+     - `Monitor` — add Uptime Kuma monitors (auto-implementable)
+     - `Architecture` — storage migration, stack redesign (human-only)
+     - `Investigation` — needs further research (human-only)
+     - `Runbook` — document a procedure (human-only)
+     - `Migration` — data or service migration (human-only)
+   - Items already fixed during the session should have Status = `Done`
+   - Items not yet done should have Status = `TODO`
+
+3. **File naming**: `docs/post-mortems/<YYYY-MM-DD>-<slug>.md`
+   - Slug: lowercase, hyphenated, max 5 words describing the incident
+
+4. **Update index**: Add an entry to `docs/post-mortems/index.html`
+   - Add a new card in the incidents grid with date, severity tag, title, description
+
+5. **Commit and push**:
+   ```
+   git add docs/post-mortems/<file>.md docs/post-mortems/index.html
+   git commit -m "docs: post-mortem for <date> <title> [ci skip]"
+   git push origin master
+   ```
+   - Use `[ci skip]` to avoid triggering app-stacks pipeline
+   - NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter)
+
+## Type Reference for Prevention Plan
+
+| Type | Auto-implementable? | Examples |
+|------|---------------------|----------|
+| Alert | Yes | Add PrometheusRule, modify alert thresholds |
+| Config | Yes | Change Terraform variables, mount options, CronJob schedules |
+| Monitor | Yes | Add Uptime Kuma HTTP/TCP monitor |
+| Architecture | No | Migrate storage class, redesign HA topology |
+| Investigation | No | Research kernel bug, check Proxmox forum |
+| Runbook | No | Document recovery procedure |
+| Migration | No | Move data between storage backends |
--- a/.claude/skills/post-mortem/template.md
+++ b/.claude/skills/post-mortem/template.md
@ -0,0 +1,85 @@
+# Post-Mortem: <TITLE>
+
+| Field | Value |
+|-------|-------|
+| **Date** | <DATE> |
+| **Duration** | <DURATION> |
+| **Severity** | <SEV1/SEV2/SEV3> |
+| **Affected Services** | <COUNT> pods across <COUNT> namespaces |
+| **Status** | Draft |
+
+## Summary
+
+<1-2 sentence summary of the incident.>
+
+## Impact
+
+- **User-facing**: <What users experienced>
+- **Blast radius**: <How many services/pods/namespaces affected>
+- **Duration**: <How long the outage lasted>
+- **Data loss**: <None/details>
+- **Monitoring gap**: <Any blind spots in alerting>
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| **HH:MM** | <First sign of trouble> |
+| **HH:MM** | <Detection / user report> |
+| **HH:MM** | <Investigation begins> |
+| **HH:MM** | <Root cause identified> |
+| **HH:MM** | <Fix applied> |
+| **HH:MM** | <Service restored> |
+
+## Root Cause
+
+<Narrative description of what went wrong and why.>
+
+## Contributing Factors
+
+1. <Factor that made the incident worse or harder to detect>
+2. <Factor...>
+
+## Detection Gaps
+
+| Gap | Impact | Fix |
+|-----|--------|-----|
+| <What wasn't monitored> | <How it delayed detection> | <What to add> |
+
+## Prevention Plan
+
+### P0 — Prevent this exact failure
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P0 | <action> | Config | <details> | TODO |
+
+### P1 — Reduce blast radius
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P1 | <action> | Alert | <details> | TODO |
+
+### P2 — Detect faster
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P2 | <action> | Monitor | <details> | TODO |
+
+### P3 — Improve resilience
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P3 | <action> | Architecture | <details> | TODO |
+
+## Lessons Learned
+
+1. <Key takeaway>
+2. <Key takeaway>
+
+## Follow-up Implementation
+
+_This section is auto-populated by the postmortem-todo-resolver agent._
+
+| Date | Action | Priority | Type | Commit | Implemented By |
+|------|--------|----------|------|--------|----------------|