infra/docs/architecture/incident-response.md
Viktor Barzin 30c3450c61 docs: add user-facing issue reporting guide
Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:37:37 +00:00

10 KiB

Incident Response & Post-Mortem Pipeline

Reporting an Issue

If something is broken or behaving unexpectedly, here's how to report it:

Where to report

Channel When to use Response time
Slack #alerts Service down, can't access something Minutes
GitHub Issue on ViktorBarzin/infra Non-urgent bugs, feature requests, recurring problems Hours
Direct message Viktor Emergencies (DNS down, cluster unreachable, data loss risk) ASAP

What to include

A good issue report helps us fix things faster. Include:

  1. What's broken — which service, URL, or feature
  2. When it started — approximate time (timezone!)
  3. What you see — error message, screenshot, HTTP status code
  4. What you expected — what should have happened

Examples

Good report:

Nextcloud at nextcloud.viktorbarzin.me returns 502 Bad Gateway since ~14:00 UTC. Was working fine this morning. Other services (Grafana, Immich) seem fine.

Also good (minimal):

ha-sofia.viktorbarzin.lan not resolving — getting NXDOMAIN

Not helpful:

Nothing works

What happens after you report

You report issue
    │
    ▼
Viktor investigates with Claude Code (cluster-health, logs, diagnostics)
    │
    ▼
Fix applied → service restored
    │
    ▼
Post-mortem auto-generated with /post-mortem
    │
    ▼
Post-mortem pushed to repo
    │
    ▼
Automated pipeline implements follow-up fixes (alerts, monitoring, config)
    │
    ▼
Post-mortem updated with implementation links
    │
    ▼
Published at GitHub Pages for review

You'll be notified in Slack when:

  • Your issue is being investigated
  • The fix is applied
  • The post-mortem is published (with what was done to prevent recurrence)

Checking service status

Common self-service checks

Before reporting, you can check:

Symptom Quick check
Service returns 502/503 Is the pod running? Check K8s Dashboard
Can't login (SSO) Try incognito window — might be cached auth
Slow performance Check if the node is under memory pressure in Grafana
DNS not resolving Try nslookup <domain> 10.0.20.201 — if that works, it's client DNS cache

Overview

Automated incident response pipeline that handles the full lifecycle: detection → mitigation → post-mortem generation → TODO implementation → documentation update. Claude Code agents automate both the post-mortem writing and the follow-up remediation, with human review gates for risky changes.

Architecture Diagram

graph TD
    A[Incident Detected] --> B[Interactive Mitigation]
    B --> C{Cluster Healthy?}
    C -->|No| B
    C -->|Yes| D[post-mortem skill]
    D --> E[git push post-mortem]
    E --> F[GitHub Webhook]
    F --> G[Woodpecker Pipeline]
    G --> H[Parse safe TODOs]
    H --> I{Safe TODOs?}
    I -->|None| J[Slack: nothing to do]
    I -->|Found| K[Vault Auth via K8s SA]
    K --> L[Fetch SSH Key]
    L --> M[SSH to DevVM]
    M --> N[Claude Code Headless Agent]
    N --> O[Terraform plan + apply]
    O --> P[Update Post-Mortem]
    P --> Q[git push]
    Q --> R[GHA: GitHub Pages]
    Q --> S[Slack Notification]

    style B fill:#6366f1
    style D fill:#6366f1
    style G fill:#4c9e47
    style N fill:#6366f1
    style R fill:#2088ff

Components

1. Post-Mortem Writer Skill

Location: .claude/skills/post-mortem/

File Purpose
skill.md Skill definition — triggered by /post-mortem command
template.md Standard post-mortem markdown template

When to use: After mitigating an incident. Auto-suggested when cluster health transitions UNHEALTHY → HEALTHY.

What it generates:

  • Standard fields (date, duration, severity, affected services)
  • Timeline from investigation session
  • Root cause chain
  • Prevention Plan with TODO table (Priority, Action, Type, Details, Status)
  • Lessons learned
  • Follow-up Implementation table (auto-populated by agent)

Type column is critical for automation:

Type Auto-implementable? Examples
Alert Yes PrometheusRule, alert thresholds
Config Yes Terraform config, NFS options
Monitor Yes Uptime Kuma HTTP/TCP monitor
Architecture No — human review Storage migration, HA redesign
Investigation No — human review Research, root cause analysis
Migration No — human review Data or service migration
Runbook No — human review Document recovery procedure

2. TODO Parser

Location: scripts/parse-postmortem-todos.sh

Shell script (POSIX sh + python3) that:

  1. Scans a post-mortem markdown file for TODO items in Prevention Plan tables
  2. Classifies each TODO as safe (Alert/Config/Monitor) or unsafe
  3. Outputs structured JSON:
{
  "file": "docs/post-mortems/2026-04-14-example.md",
  "todos": [{"priority": "P2", "action": "Add NFS alert", "type": "Alert", "details": "...", "safe": true}],
  "skipped": [{"priority": "P1", "action": "Migrate Vault", "type": "Migration", "details": "...", "safe": false}],
  "safe_todos": 3,
  "skipped_todos": 2
}

Supports both the new template format (Priority | Action | Type | Details | Status) and the legacy format (Action | Status | Details), inferring types from action text for legacy.

3. Woodpecker Pipeline

Location: .woodpecker/postmortem-todos.yml

Trigger: Push to master with changes in docs/post-mortems/*.md

Steps:

  1. parse-and-implement: Runs scripts/postmortem-pipeline.sh which:

    • Scans all post-mortems for pending TODOs (no git diff — avoids shallow clone issues)
    • Parses safe TODOs via the parser script
    • Authenticates to Vault via K8s Service Account JWT
    • Fetches DevVM SSH key from secret/ci/infradevvm_ssh_key
    • SSHes to DevVM (10.0.10.10) and runs Claude Code headless
  2. notify-slack: Posts pipeline result to Slack

Authentication chain: Woodpecker pod → K8s SA token → Vault K8s auth (role: ci) → secret/data/ci/infra → SSH key → DevVM

4. TODO Resolver Agent

Location: .claude/agents/postmortem-todo-resolver.md

Claude Code agent that runs in headless mode (claude -p --agent postmortem-todo-resolver).

What it does per TODO (in priority order P0 → P3):

  1. Reads relevant Terraform files
  2. Implements the change (edit .tf, .tpl, etc.)
  3. Runs scripts/tg plan — aborts if any resources would be destroyed
  4. Runs scripts/tg apply --non-interactive
  5. Commits with: fix(post-mortem): <action> [PM-YYYY-MM-DD]

After all TODOs:

  • Updates the Prevention Plan table: TODODone
  • Populates the Follow-up Implementation table:
Date Action Priority Type Commit Implemented By
2026-04-14 Add NFS RPC retransmission alert P2 Alert abc1234 postmortem-todo-resolver
Migrate Vault to encrypted PVC P1 Migration Needs human review

Safety guardrails:

  • Only implements Alert, Config, Monitor types
  • Never modifies platform stacks (vault, dbaas, traefik, authentik)
  • Aborts if Terraform plan shows any destroys
  • Budget cap: $5 per run
  • Skipped items marked as "Needs human review"

5. Cluster Health Auto-Suggest

Location: .claude/skills/cluster-health/SKILL.md

After running a healthcheck, if the cluster recovered from a previous unhealthy state, the skill suggests:

The cluster has recovered. Would you like me to write a post-mortem? Run /post-mortem to generate one.

Secrets & Configuration

Secret Vault Path Purpose
DevVM SSH key secret/ci/infradevvm_ssh_key Woodpecker → DevVM SSH access
Slack webhook Woodpecker global secret slack_webhook Pipeline notifications
Anthropic API key ~/.claude/ on DevVM Claude Code headless mode

File Inventory

File Type Description
.claude/skills/post-mortem/skill.md Skill Post-mortem writer definition
.claude/skills/post-mortem/template.md Template Post-mortem markdown skeleton
.claude/agents/postmortem-todo-resolver.md Agent Headless TODO implementation agent
.woodpecker/postmortem-todos.yml Pipeline Woodpecker CI triggered on post-mortem changes
scripts/postmortem-pipeline.sh Script Pipeline orchestration (parse, auth, SSH, invoke)
scripts/parse-postmortem-todos.sh Script TODO extraction from markdown
docs/post-mortems/ Directory All post-mortem documents
docs/post-mortems/index.html Static Post-mortem index page (deployed to GH Pages)

Commit Conventions

Pattern Used by Example
fix(post-mortem): <action> [PM-YYYY-MM-DD] TODO resolver agent fix(post-mortem): add NFS alert [PM-2026-04-14]
docs: post-mortem for <date> <title> [ci skip] Post-mortem writer skill docs: post-mortem for 2026-04-14 NFS outage [ci skip]
docs: update post-mortem follow-up [PM-YYYY-MM-DD] [ci skip] TODO resolver agent Final update with Follow-up table

Limitations

  • Woodpecker shallow clone: The pipeline scans all post-mortems for TODOs rather than diffing HEAD~1 (shallow clone breaks git history)
  • Single DevVM: The agent runs on 10.0.10.10 — if DevVM is down, pipeline fails. Could be extended to multiple hosts.
  • Anthropic API dependency: Headless Claude Code requires API access. Budget capped at $5 per run.
  • No interactive approval: The agent cannot ask for human approval mid-run. Risky items are skipped entirely.