From f658b18a50524adec3dd47255b2014f99bd6dd50 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Tue, 14 Apr 2026 18:31:29 +0000 Subject: [PATCH] docs: add incident response & post-mortem pipeline architecture Documents the full E2E automated incident response pipeline: - /post-mortem skill for generating structured post-mortems - Woodpecker pipeline triggered on post-mortem file changes - Claude Code headless agent for implementing safe TODOs - Safety guardrails, commit conventions, secrets, limitations [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/architecture/incident-response.md | 182 +++++++++++++++++++++++++ 1 file changed, 182 insertions(+) create mode 100644 docs/architecture/incident-response.md diff --git a/docs/architecture/incident-response.md b/docs/architecture/incident-response.md new file mode 100644 index 00000000..ddb11e77 --- /dev/null +++ b/docs/architecture/incident-response.md @@ -0,0 +1,182 @@ +# Incident Response & Post-Mortem Pipeline + +## Overview + +Automated incident response pipeline that handles the full lifecycle: detection → mitigation → post-mortem generation → TODO implementation → documentation update. Claude Code agents automate both the post-mortem writing and the follow-up remediation, with human review gates for risky changes. + +## Architecture Diagram + +```mermaid +graph TD + A[Incident Detected] --> B[Interactive Mitigation
Claude Code CLI] + B --> C{Cluster Healthy?} + C -->|No| B + C -->|Yes| D[/post-mortem skill
generates markdown] + D --> E[git push to
docs/post-mortems/*.md] + E --> F[GitHub Webhook] + F --> G[Woodpecker Pipeline
postmortem-todos.yml] + G --> H[parse-postmortem-todos.sh
Extract safe TODOs] + H --> I{Safe TODOs?} + I -->|None| J[Slack: No auto-implementable TODOs] + I -->|Found| K[Vault Auth
K8s SA JWT] + K --> L[Fetch SSH Key
secret/ci/infra] + L --> M[SSH to DevVM
10.0.10.10] + M --> N[Claude Code Headless
postmortem-todo-resolver agent] + N --> O[Implement TODOs
Terraform plan/apply] + O --> P[Update Post-Mortem
Follow-up Table] + P --> Q[git push] + Q --> R[GHA: Deploy to
GitHub Pages] + Q --> S[Slack Notification] + + style B fill:#6366f1 + style D fill:#6366f1 + style G fill:#4c9e47 + style N fill:#6366f1 + style R fill:#2088ff +``` + +## Components + +### 1. Post-Mortem Writer Skill + +**Location**: `.claude/skills/post-mortem/` + +| File | Purpose | +|------|---------| +| `skill.md` | Skill definition — triggered by `/post-mortem` command | +| `template.md` | Standard post-mortem markdown template | + +**When to use**: After mitigating an incident. Auto-suggested when cluster health transitions UNHEALTHY → HEALTHY. + +**What it generates**: +- Standard fields (date, duration, severity, affected services) +- Timeline from investigation session +- Root cause chain +- Prevention Plan with TODO table (Priority, Action, **Type**, Details, Status) +- Lessons learned +- Follow-up Implementation table (auto-populated by agent) + +**Type column** is critical for automation: + +| Type | Auto-implementable? | Examples | +|------|---------------------|----------| +| `Alert` | Yes | PrometheusRule, alert thresholds | +| `Config` | Yes | Terraform config, NFS options | +| `Monitor` | Yes | Uptime Kuma HTTP/TCP monitor | +| `Architecture` | No — human review | Storage migration, HA redesign | +| `Investigation` | No — human review | Research, root cause analysis | +| `Migration` | No — human review | Data or service migration | +| `Runbook` | No — human review | Document recovery procedure | + +### 2. TODO Parser + +**Location**: `scripts/parse-postmortem-todos.sh` + +Shell script (POSIX sh + python3) that: +1. Scans a post-mortem markdown file for TODO items in Prevention Plan tables +2. Classifies each TODO as safe (Alert/Config/Monitor) or unsafe +3. Outputs structured JSON: + +```json +{ + "file": "docs/post-mortems/2026-04-14-example.md", + "todos": [{"priority": "P2", "action": "Add NFS alert", "type": "Alert", "details": "...", "safe": true}], + "skipped": [{"priority": "P1", "action": "Migrate Vault", "type": "Migration", "details": "...", "safe": false}], + "safe_todos": 3, + "skipped_todos": 2 +} +``` + +Supports both the new template format (`Priority | Action | Type | Details | Status`) and the legacy format (`Action | Status | Details`), inferring types from action text for legacy. + +### 3. Woodpecker Pipeline + +**Location**: `.woodpecker/postmortem-todos.yml` + +**Trigger**: Push to `master` with changes in `docs/post-mortems/*.md` + +**Steps**: + +1. **parse-and-implement**: Runs `scripts/postmortem-pipeline.sh` which: + - Scans all post-mortems for pending TODOs (no git diff — avoids shallow clone issues) + - Parses safe TODOs via the parser script + - Authenticates to Vault via K8s Service Account JWT + - Fetches DevVM SSH key from `secret/ci/infra` → `devvm_ssh_key` + - SSHes to DevVM (10.0.10.10) and runs Claude Code headless + +2. **notify-slack**: Posts pipeline result to Slack + +**Authentication chain**: Woodpecker pod → K8s SA token → Vault K8s auth (role: `ci`) → `secret/data/ci/infra` → SSH key → DevVM + +### 4. TODO Resolver Agent + +**Location**: `.claude/agents/postmortem-todo-resolver.md` + +Claude Code agent that runs in headless mode (`claude -p --agent postmortem-todo-resolver`). + +**What it does per TODO** (in priority order P0 → P3): +1. Reads relevant Terraform files +2. Implements the change (edit `.tf`, `.tpl`, etc.) +3. Runs `scripts/tg plan` — aborts if any resources would be destroyed +4. Runs `scripts/tg apply --non-interactive` +5. Commits with: `fix(post-mortem): [PM-YYYY-MM-DD]` + +**After all TODOs**: +- Updates the Prevention Plan table: `TODO` → `Done` +- Populates the **Follow-up Implementation** table: + +| Date | Action | Priority | Type | Commit | Implemented By | +|------|--------|----------|------|--------|----------------| +| 2026-04-14 | Add NFS RPC retransmission alert | P2 | Alert | [`abc1234`](https://github.com/ViktorBarzin/infra/commit/abc1234) | postmortem-todo-resolver | +| — | Migrate Vault to encrypted PVC | P1 | Migration | — | Needs human review | + +**Safety guardrails**: +- Only implements Alert, Config, Monitor types +- Never modifies platform stacks (vault, dbaas, traefik, authentik) +- Aborts if Terraform plan shows any destroys +- Budget cap: $5 per run +- Skipped items marked as "Needs human review" + +### 5. Cluster Health Auto-Suggest + +**Location**: `.claude/skills/cluster-health/SKILL.md` + +After running a healthcheck, if the cluster recovered from a previous unhealthy state, the skill suggests: + +> The cluster has recovered. Would you like me to write a post-mortem? Run `/post-mortem` to generate one. + +## Secrets & Configuration + +| Secret | Vault Path | Purpose | +|--------|-----------|---------| +| DevVM SSH key | `secret/ci/infra` → `devvm_ssh_key` | Woodpecker → DevVM SSH access | +| Slack webhook | Woodpecker global secret `slack_webhook` | Pipeline notifications | +| Anthropic API key | `~/.claude/` on DevVM | Claude Code headless mode | + +## File Inventory + +| File | Type | Description | +|------|------|-------------| +| `.claude/skills/post-mortem/skill.md` | Skill | Post-mortem writer definition | +| `.claude/skills/post-mortem/template.md` | Template | Post-mortem markdown skeleton | +| `.claude/agents/postmortem-todo-resolver.md` | Agent | Headless TODO implementation agent | +| `.woodpecker/postmortem-todos.yml` | Pipeline | Woodpecker CI triggered on post-mortem changes | +| `scripts/postmortem-pipeline.sh` | Script | Pipeline orchestration (parse, auth, SSH, invoke) | +| `scripts/parse-postmortem-todos.sh` | Script | TODO extraction from markdown | +| `docs/post-mortems/` | Directory | All post-mortem documents | +| `docs/post-mortems/index.html` | Static | Post-mortem index page (deployed to GH Pages) | + +## Commit Conventions + +| Pattern | Used by | Example | +|---------|---------|---------| +| `fix(post-mortem): [PM-YYYY-MM-DD]` | TODO resolver agent | `fix(post-mortem): add NFS alert [PM-2026-04-14]` | +| `docs: post-mortem for [ci skip]` | Post-mortem writer skill | `docs: post-mortem for 2026-04-14 NFS outage [ci skip]` | +| `docs: update post-mortem follow-up [PM-YYYY-MM-DD] [ci skip]` | TODO resolver agent | Final update with Follow-up table | + +## Limitations + +- **Woodpecker shallow clone**: The pipeline scans all post-mortems for TODOs rather than diffing `HEAD~1` (shallow clone breaks git history) +- **Single DevVM**: The agent runs on 10.0.10.10 — if DevVM is down, pipeline fails. Could be extended to multiple hosts. +- **Anthropic API dependency**: Headless Claude Code requires API access. Budget capped at $5 per run. +- **No interactive approval**: The agent cannot ask for human approval mid-run. Risky items are skipped entirely.