# Incident Response & Post-Mortem Pipeline ## Reporting an Issue If something is broken or behaving unexpectedly, here's how to report it: ### Where to report | Channel | When to use | Response time | |---------|-------------|---------------| | **Slack #alerts** | Service down, can't access something | Minutes | | **GitHub Issue** on [ViktorBarzin/infra](https://github.com/ViktorBarzin/infra/issues) | Non-urgent bugs, feature requests, recurring problems | Hours | | **Direct message Viktor** | Emergencies (DNS down, cluster unreachable, data loss risk) | ASAP | ### What to include A good issue report helps us fix things faster. Include: 1. **What's broken** — which service, URL, or feature 2. **When it started** — approximate time (timezone!) 3. **What you see** — error message, screenshot, HTTP status code 4. **What you expected** — what should have happened ### Examples **Good report:** > Nextcloud at nextcloud.viktorbarzin.me returns 502 Bad Gateway since ~14:00 UTC. > Was working fine this morning. Other services (Grafana, Immich) seem fine. **Also good (minimal):** > ha-sofia.viktorbarzin.lan not resolving — getting NXDOMAIN **Not helpful:** > Nothing works ### What happens after you report ``` You report issue │ ▼ Viktor investigates with Claude Code (cluster-health, logs, diagnostics) │ ▼ Fix applied → service restored │ ▼ Post-mortem auto-generated with /post-mortem │ ▼ Post-mortem pushed to repo │ ▼ Automated pipeline implements follow-up fixes (alerts, monitoring, config) │ ▼ Post-mortem updated with implementation links │ ▼ Published at GitHub Pages for review ``` You'll be notified in Slack when: - Your issue is being investigated - The fix is applied - The post-mortem is published (with what was done to prevent recurrence) ### Checking service status - **Uptime dashboard**: [uptime.viktorbarzin.me](https://uptime.viktorbarzin.me) — real-time status of all services - **Post-mortems**: [ViktorBarzin/infra post-mortems](https://github.com/ViktorBarzin/infra/tree/master/docs/post-mortems) — past incidents and their fixes - **Grafana**: [grafana.viktorbarzin.me](https://grafana.viktorbarzin.me) — metrics and dashboards ### Common self-service checks Before reporting, you can check: | Symptom | Quick check | |---------|-------------| | Service returns 502/503 | Is the pod running? Check [K8s Dashboard](https://dashboard.viktorbarzin.me) | | Can't login (SSO) | Try incognito window — might be cached auth | | Slow performance | Check if the node is under memory pressure in Grafana | | DNS not resolving | Try `nslookup 10.0.20.201` — if that works, it's client DNS cache | --- ## Overview Automated incident response pipeline that handles the full lifecycle: detection → mitigation → post-mortem generation → TODO implementation → documentation update. Claude Code agents automate both the post-mortem writing and the follow-up remediation, with human review gates for risky changes. ## Architecture Diagram ```mermaid graph TD A[Incident Detected] --> B[Interactive Mitigation] B --> C{Cluster Healthy?} C -->|No| B C -->|Yes| D[post-mortem skill] D --> E[git push post-mortem] E --> F[GitHub Webhook] F --> G[Woodpecker Pipeline] G --> H[Parse safe TODOs] H --> I{Safe TODOs?} I -->|None| J[Slack: nothing to do] I -->|Found| K[Vault Auth via K8s SA] K --> L[Fetch SSH Key] L --> M[SSH to DevVM] M --> N[Claude Code Headless Agent] N --> O[Terraform plan + apply] O --> P[Update Post-Mortem] P --> Q[git push] Q --> R[GHA: GitHub Pages] Q --> S[Slack Notification] style B fill:#6366f1 style D fill:#6366f1 style G fill:#4c9e47 style N fill:#6366f1 style R fill:#2088ff ``` ## Components ### 1. Post-Mortem Writer Skill **Location**: `.claude/skills/post-mortem/` | File | Purpose | |------|---------| | `skill.md` | Skill definition — triggered by `/post-mortem` command | | `template.md` | Standard post-mortem markdown template | **When to use**: After mitigating an incident. Auto-suggested when cluster health transitions UNHEALTHY → HEALTHY. **What it generates**: - Standard fields (date, duration, severity, affected services) - Timeline from investigation session - Root cause chain - Prevention Plan with TODO table (Priority, Action, **Type**, Details, Status) - Lessons learned - Follow-up Implementation table (auto-populated by agent) **Type column** is critical for automation: | Type | Auto-implementable? | Examples | |------|---------------------|----------| | `Alert` | Yes | PrometheusRule, alert thresholds | | `Config` | Yes | Terraform config, NFS options | | `Monitor` | Yes | Uptime Kuma HTTP/TCP monitor | | `Architecture` | No — human review | Storage migration, HA redesign | | `Investigation` | No — human review | Research, root cause analysis | | `Migration` | No — human review | Data or service migration | | `Runbook` | No — human review | Document recovery procedure | ### 2. TODO Parser **Location**: `scripts/parse-postmortem-todos.sh` Shell script (POSIX sh + python3) that: 1. Scans a post-mortem markdown file for TODO items in Prevention Plan tables 2. Classifies each TODO as safe (Alert/Config/Monitor) or unsafe 3. Outputs structured JSON: ```json { "file": "docs/post-mortems/2026-04-14-example.md", "todos": [{"priority": "P2", "action": "Add NFS alert", "type": "Alert", "details": "...", "safe": true}], "skipped": [{"priority": "P1", "action": "Migrate Vault", "type": "Migration", "details": "...", "safe": false}], "safe_todos": 3, "skipped_todos": 2 } ``` Supports both the new template format (`Priority | Action | Type | Details | Status`) and the legacy format (`Action | Status | Details`), inferring types from action text for legacy. ### 3. Woodpecker Pipeline **Location**: `.woodpecker/postmortem-todos.yml` **Trigger**: Push to `master` with changes in `docs/post-mortems/*.md` **Steps**: 1. **parse-and-implement**: Runs `scripts/postmortem-pipeline.sh` which: - Scans all post-mortems for pending TODOs (no git diff — avoids shallow clone issues) - Parses safe TODOs via the parser script - Authenticates to Vault via K8s Service Account JWT - Fetches DevVM SSH key from `secret/ci/infra` → `devvm_ssh_key` - SSHes to DevVM (10.0.10.10) and runs Claude Code headless 2. **notify-slack**: Posts pipeline result to Slack **Authentication chain**: Woodpecker pod → K8s SA token → Vault K8s auth (role: `ci`) → `secret/data/ci/infra` → SSH key → DevVM ### 4. TODO Resolver Agent **Location**: `.claude/agents/postmortem-todo-resolver.md` Claude Code agent that runs in headless mode (`claude -p --agent postmortem-todo-resolver`). **What it does per TODO** (in priority order P0 → P3): 1. Reads relevant Terraform files 2. Implements the change (edit `.tf`, `.tpl`, etc.) 3. Runs `scripts/tg plan` — aborts if any resources would be destroyed 4. Runs `scripts/tg apply --non-interactive` 5. Commits with: `fix(post-mortem): [PM-YYYY-MM-DD]` **After all TODOs**: - Updates the Prevention Plan table: `TODO` → `Done` - Populates the **Follow-up Implementation** table: | Date | Action | Priority | Type | Commit | Implemented By | |------|--------|----------|------|--------|----------------| | 2026-04-14 | Add NFS RPC retransmission alert | P2 | Alert | [`abc1234`](https://github.com/ViktorBarzin/infra/commit/abc1234) | postmortem-todo-resolver | | — | Migrate Vault to encrypted PVC | P1 | Migration | — | Needs human review | **Safety guardrails**: - Only implements Alert, Config, Monitor types - Never modifies platform stacks (vault, dbaas, traefik, authentik) - Aborts if Terraform plan shows any destroys - Budget cap: $5 per run - Skipped items marked as "Needs human review" ### 5. Cluster Health Auto-Suggest **Location**: `.claude/skills/cluster-health/SKILL.md` After running a healthcheck, if the cluster recovered from a previous unhealthy state, the skill suggests: > The cluster has recovered. Would you like me to write a post-mortem? Run `/post-mortem` to generate one. ## Secrets & Configuration | Secret | Vault Path | Purpose | |--------|-----------|---------| | DevVM SSH key | `secret/ci/infra` → `devvm_ssh_key` | Woodpecker → DevVM SSH access | | Slack webhook | Woodpecker global secret `slack_webhook` | Pipeline notifications | | Anthropic API key | `~/.claude/` on DevVM | Claude Code headless mode | ## File Inventory | File | Type | Description | |------|------|-------------| | `.claude/skills/post-mortem/skill.md` | Skill | Post-mortem writer definition | | `.claude/skills/post-mortem/template.md` | Template | Post-mortem markdown skeleton | | `.claude/agents/postmortem-todo-resolver.md` | Agent | Headless TODO implementation agent | | `.woodpecker/postmortem-todos.yml` | Pipeline | Woodpecker CI triggered on post-mortem changes | | `scripts/postmortem-pipeline.sh` | Script | Pipeline orchestration (parse, auth, SSH, invoke) | | `scripts/parse-postmortem-todos.sh` | Script | TODO extraction from markdown | | `docs/post-mortems/` | Directory | All post-mortem documents | | `docs/post-mortems/index.html` | Static | Post-mortem index page (deployed to GH Pages) | ## Commit Conventions | Pattern | Used by | Example | |---------|---------|---------| | `fix(post-mortem): [PM-YYYY-MM-DD]` | TODO resolver agent | `fix(post-mortem): add NFS alert [PM-2026-04-14]` | | `docs: post-mortem for [ci skip]` | Post-mortem writer skill | `docs: post-mortem for 2026-04-14 NFS outage [ci skip]` | | `docs: update post-mortem follow-up [PM-YYYY-MM-DD] [ci skip]` | TODO resolver agent | Final update with Follow-up table | ## Limitations - **Woodpecker shallow clone**: The pipeline scans all post-mortems for TODOs rather than diffing `HEAD~1` (shallow clone breaks git history) - **Single DevVM**: The agent runs on 10.0.10.10 — if DevVM is down, pipeline fails. Could be extended to multiple hosts. - **Anthropic API dependency**: Headless Claude Code requires API access. Budget capped at $5 per run. - **No interactive approval**: The agent cannot ask for human approval mid-run. Risky items are skipped entirely.