docs: add incident response & post-mortem pipeline architecture

Documents the full E2E automated incident response pipeline: - /post-mortem skill for generating structured post-mortems - Woodpecker pipeline triggered on post-mortem file changes - Claude Code headless agent for implementing safe TODOs - Safety guardrails, commit conventions, secrets, limitations [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:31:29 +00:00 · 2026-04-14 18:31:29 +00:00 · f658b18a50
commit f658b18a50
parent cf87a747d8
1 changed files with 182 additions and 0 deletions
--- a/docs/architecture/incident-response.md
+++ b/docs/architecture/incident-response.md
@ -0,0 +1,182 @@
+# Incident Response & Post-Mortem Pipeline
+
+## Overview
+
+Automated incident response pipeline that handles the full lifecycle: detection → mitigation → post-mortem generation → TODO implementation → documentation update. Claude Code agents automate both the post-mortem writing and the follow-up remediation, with human review gates for risky changes.
+
+## Architecture Diagram
+
+```mermaid
+graph TD
+    A[Incident Detected] --> B[Interactive Mitigation<br/>Claude Code CLI]
+    B --> C{Cluster Healthy?}
+    C -->|No| B
+    C -->|Yes| D[/post-mortem skill<br/>generates markdown]
+    D --> E[git push to<br/>docs/post-mortems/*.md]
+    E --> F[GitHub Webhook]
+    F --> G[Woodpecker Pipeline<br/>postmortem-todos.yml]
+    G --> H[parse-postmortem-todos.sh<br/>Extract safe TODOs]
+    H --> I{Safe TODOs?}
+    I -->|None| J[Slack: No auto-implementable TODOs]
+    I -->|Found| K[Vault Auth<br/>K8s SA JWT]
+    K --> L[Fetch SSH Key<br/>secret/ci/infra]
+    L --> M[SSH to DevVM<br/>10.0.10.10]
+    M --> N[Claude Code Headless<br/>postmortem-todo-resolver agent]
+    N --> O[Implement TODOs<br/>Terraform plan/apply]
+    O --> P[Update Post-Mortem<br/>Follow-up Table]
+    P --> Q[git push]
+    Q --> R[GHA: Deploy to<br/>GitHub Pages]
+    Q --> S[Slack Notification]
+
+    style B fill:#6366f1
+    style D fill:#6366f1
+    style G fill:#4c9e47
+    style N fill:#6366f1
+    style R fill:#2088ff
+```
+
+## Components
+
+### 1. Post-Mortem Writer Skill
+
+**Location**: `.claude/skills/post-mortem/`
+
+| File | Purpose |
+|------|---------|
+| `skill.md` | Skill definition — triggered by `/post-mortem` command |
+| `template.md` | Standard post-mortem markdown template |
+
+**When to use**: After mitigating an incident. Auto-suggested when cluster health transitions UNHEALTHY → HEALTHY.
+
+**What it generates**:
+- Standard fields (date, duration, severity, affected services)
+- Timeline from investigation session
+- Root cause chain
+- Prevention Plan with TODO table (Priority, Action, **Type**, Details, Status)
+- Lessons learned
+- Follow-up Implementation table (auto-populated by agent)
+
+**Type column** is critical for automation:
+
+| Type | Auto-implementable? | Examples |
+|------|---------------------|----------|
+| `Alert` | Yes | PrometheusRule, alert thresholds |
+| `Config` | Yes | Terraform config, NFS options |
+| `Monitor` | Yes | Uptime Kuma HTTP/TCP monitor |
+| `Architecture` | No — human review | Storage migration, HA redesign |
+| `Investigation` | No — human review | Research, root cause analysis |
+| `Migration` | No — human review | Data or service migration |
+| `Runbook` | No — human review | Document recovery procedure |
+
+### 2. TODO Parser
+
+**Location**: `scripts/parse-postmortem-todos.sh`
+
+Shell script (POSIX sh + python3) that:
+1. Scans a post-mortem markdown file for TODO items in Prevention Plan tables
+2. Classifies each TODO as safe (Alert/Config/Monitor) or unsafe
+3. Outputs structured JSON:
+
+```json
+{
+  "file": "docs/post-mortems/2026-04-14-example.md",
+  "todos": [{"priority": "P2", "action": "Add NFS alert", "type": "Alert", "details": "...", "safe": true}],
+  "skipped": [{"priority": "P1", "action": "Migrate Vault", "type": "Migration", "details": "...", "safe": false}],
+  "safe_todos": 3,
+  "skipped_todos": 2
+}
+```
+
+Supports both the new template format (`Priority | Action | Type | Details | Status`) and the legacy format (`Action | Status | Details`), inferring types from action text for legacy.
+
+### 3. Woodpecker Pipeline
+
+**Location**: `.woodpecker/postmortem-todos.yml`
+
+**Trigger**: Push to `master` with changes in `docs/post-mortems/*.md`
+
+**Steps**:
+
+1. **parse-and-implement**: Runs `scripts/postmortem-pipeline.sh` which:
+   - Scans all post-mortems for pending TODOs (no git diff — avoids shallow clone issues)
+   - Parses safe TODOs via the parser script
+   - Authenticates to Vault via K8s Service Account JWT
+   - Fetches DevVM SSH key from `secret/ci/infra` → `devvm_ssh_key`
+   - SSHes to DevVM (10.0.10.10) and runs Claude Code headless
+
+2. **notify-slack**: Posts pipeline result to Slack
+
+**Authentication chain**: Woodpecker pod → K8s SA token → Vault K8s auth (role: `ci`) → `secret/data/ci/infra` → SSH key → DevVM
+
+### 4. TODO Resolver Agent
+
+**Location**: `.claude/agents/postmortem-todo-resolver.md`
+
+Claude Code agent that runs in headless mode (`claude -p --agent postmortem-todo-resolver`).
+
+**What it does per TODO** (in priority order P0 → P3):
+1. Reads relevant Terraform files
+2. Implements the change (edit `.tf`, `.tpl`, etc.)
+3. Runs `scripts/tg plan` — aborts if any resources would be destroyed
+4. Runs `scripts/tg apply --non-interactive`
+5. Commits with: `fix(post-mortem): <action> [PM-YYYY-MM-DD]`
+
+**After all TODOs**:
+- Updates the Prevention Plan table: `TODO` → `Done`
+- Populates the **Follow-up Implementation** table:
+
+| Date | Action | Priority | Type | Commit | Implemented By |
+|------|--------|----------|------|--------|----------------|
+| 2026-04-14 | Add NFS RPC retransmission alert | P2 | Alert | [`abc1234`](https://github.com/ViktorBarzin/infra/commit/abc1234) | postmortem-todo-resolver |
+| — | Migrate Vault to encrypted PVC | P1 | Migration | — | Needs human review |
+
+**Safety guardrails**:
+- Only implements Alert, Config, Monitor types
+- Never modifies platform stacks (vault, dbaas, traefik, authentik)
+- Aborts if Terraform plan shows any destroys
+- Budget cap: $5 per run
+- Skipped items marked as "Needs human review"
+
+### 5. Cluster Health Auto-Suggest
+
+**Location**: `.claude/skills/cluster-health/SKILL.md`
+
+After running a healthcheck, if the cluster recovered from a previous unhealthy state, the skill suggests:
+
+> The cluster has recovered. Would you like me to write a post-mortem? Run `/post-mortem` to generate one.
+
+## Secrets & Configuration
+
+| Secret | Vault Path | Purpose |
+|--------|-----------|---------|
+| DevVM SSH key | `secret/ci/infra` → `devvm_ssh_key` | Woodpecker → DevVM SSH access |
+| Slack webhook | Woodpecker global secret `slack_webhook` | Pipeline notifications |
+| Anthropic API key | `~/.claude/` on DevVM | Claude Code headless mode |
+
+## File Inventory
+
+| File | Type | Description |
+|------|------|-------------|
+| `.claude/skills/post-mortem/skill.md` | Skill | Post-mortem writer definition |
+| `.claude/skills/post-mortem/template.md` | Template | Post-mortem markdown skeleton |
+| `.claude/agents/postmortem-todo-resolver.md` | Agent | Headless TODO implementation agent |
+| `.woodpecker/postmortem-todos.yml` | Pipeline | Woodpecker CI triggered on post-mortem changes |
+| `scripts/postmortem-pipeline.sh` | Script | Pipeline orchestration (parse, auth, SSH, invoke) |
+| `scripts/parse-postmortem-todos.sh` | Script | TODO extraction from markdown |
+| `docs/post-mortems/` | Directory | All post-mortem documents |
+| `docs/post-mortems/index.html` | Static | Post-mortem index page (deployed to GH Pages) |
+
+## Commit Conventions
+
+| Pattern | Used by | Example |
+|---------|---------|---------|
+| `fix(post-mortem): <action> [PM-YYYY-MM-DD]` | TODO resolver agent | `fix(post-mortem): add NFS alert [PM-2026-04-14]` |
+| `docs: post-mortem for <date> <title> [ci skip]` | Post-mortem writer skill | `docs: post-mortem for 2026-04-14 NFS outage [ci skip]` |
+| `docs: update post-mortem follow-up [PM-YYYY-MM-DD] [ci skip]` | TODO resolver agent | Final update with Follow-up table |
+
+## Limitations
+
+- **Woodpecker shallow clone**: The pipeline scans all post-mortems for TODOs rather than diffing `HEAD~1` (shallow clone breaks git history)
+- **Single DevVM**: The agent runs on 10.0.10.10 — if DevVM is down, pipeline fails. Could be extended to multiple hosts.
+- **Anthropic API dependency**: Headless Claude Code requires API access. Budget capped at $5 per run.
+- **No interactive approval**: The agent cannot ask for human approval mid-run. Risky items are skipped entirely.