Adds "Reporting an Issue" section with: - Where to report (Slack, GitHub, DM) - What to include (examples of good vs bad reports) - What happens after reporting (flow diagram) - Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10 KiB
Incident Response & Post-Mortem Pipeline
Reporting an Issue
If something is broken or behaving unexpectedly, here's how to report it:
Where to report
| Channel | When to use | Response time |
|---|---|---|
| Slack #alerts | Service down, can't access something | Minutes |
| GitHub Issue on ViktorBarzin/infra | Non-urgent bugs, feature requests, recurring problems | Hours |
| Direct message Viktor | Emergencies (DNS down, cluster unreachable, data loss risk) | ASAP |
What to include
A good issue report helps us fix things faster. Include:
- What's broken — which service, URL, or feature
- When it started — approximate time (timezone!)
- What you see — error message, screenshot, HTTP status code
- What you expected — what should have happened
Examples
Good report:
Nextcloud at nextcloud.viktorbarzin.me returns 502 Bad Gateway since ~14:00 UTC. Was working fine this morning. Other services (Grafana, Immich) seem fine.
Also good (minimal):
ha-sofia.viktorbarzin.lan not resolving — getting NXDOMAIN
Not helpful:
Nothing works
What happens after you report
You report issue
│
▼
Viktor investigates with Claude Code (cluster-health, logs, diagnostics)
│
▼
Fix applied → service restored
│
▼
Post-mortem auto-generated with /post-mortem
│
▼
Post-mortem pushed to repo
│
▼
Automated pipeline implements follow-up fixes (alerts, monitoring, config)
│
▼
Post-mortem updated with implementation links
│
▼
Published at GitHub Pages for review
You'll be notified in Slack when:
- Your issue is being investigated
- The fix is applied
- The post-mortem is published (with what was done to prevent recurrence)
Checking service status
- Uptime dashboard: uptime.viktorbarzin.me — real-time status of all services
- Post-mortems: ViktorBarzin/infra post-mortems — past incidents and their fixes
- Grafana: grafana.viktorbarzin.me — metrics and dashboards
Common self-service checks
Before reporting, you can check:
| Symptom | Quick check |
|---|---|
| Service returns 502/503 | Is the pod running? Check K8s Dashboard |
| Can't login (SSO) | Try incognito window — might be cached auth |
| Slow performance | Check if the node is under memory pressure in Grafana |
| DNS not resolving | Try nslookup <domain> 10.0.20.201 — if that works, it's client DNS cache |
Overview
Automated incident response pipeline that handles the full lifecycle: detection → mitigation → post-mortem generation → TODO implementation → documentation update. Claude Code agents automate both the post-mortem writing and the follow-up remediation, with human review gates for risky changes.
Architecture Diagram
graph TD
A[Incident Detected] --> B[Interactive Mitigation]
B --> C{Cluster Healthy?}
C -->|No| B
C -->|Yes| D[post-mortem skill]
D --> E[git push post-mortem]
E --> F[GitHub Webhook]
F --> G[Woodpecker Pipeline]
G --> H[Parse safe TODOs]
H --> I{Safe TODOs?}
I -->|None| J[Slack: nothing to do]
I -->|Found| K[Vault Auth via K8s SA]
K --> L[Fetch SSH Key]
L --> M[SSH to DevVM]
M --> N[Claude Code Headless Agent]
N --> O[Terraform plan + apply]
O --> P[Update Post-Mortem]
P --> Q[git push]
Q --> R[GHA: GitHub Pages]
Q --> S[Slack Notification]
style B fill:#6366f1
style D fill:#6366f1
style G fill:#4c9e47
style N fill:#6366f1
style R fill:#2088ff
Components
1. Post-Mortem Writer Skill
Location: .claude/skills/post-mortem/
| File | Purpose |
|---|---|
skill.md |
Skill definition — triggered by /post-mortem command |
template.md |
Standard post-mortem markdown template |
When to use: After mitigating an incident. Auto-suggested when cluster health transitions UNHEALTHY → HEALTHY.
What it generates:
- Standard fields (date, duration, severity, affected services)
- Timeline from investigation session
- Root cause chain
- Prevention Plan with TODO table (Priority, Action, Type, Details, Status)
- Lessons learned
- Follow-up Implementation table (auto-populated by agent)
Type column is critical for automation:
| Type | Auto-implementable? | Examples |
|---|---|---|
Alert |
Yes | PrometheusRule, alert thresholds |
Config |
Yes | Terraform config, NFS options |
Monitor |
Yes | Uptime Kuma HTTP/TCP monitor |
Architecture |
No — human review | Storage migration, HA redesign |
Investigation |
No — human review | Research, root cause analysis |
Migration |
No — human review | Data or service migration |
Runbook |
No — human review | Document recovery procedure |
2. TODO Parser
Location: scripts/parse-postmortem-todos.sh
Shell script (POSIX sh + python3) that:
- Scans a post-mortem markdown file for TODO items in Prevention Plan tables
- Classifies each TODO as safe (Alert/Config/Monitor) or unsafe
- Outputs structured JSON:
{
"file": "docs/post-mortems/2026-04-14-example.md",
"todos": [{"priority": "P2", "action": "Add NFS alert", "type": "Alert", "details": "...", "safe": true}],
"skipped": [{"priority": "P1", "action": "Migrate Vault", "type": "Migration", "details": "...", "safe": false}],
"safe_todos": 3,
"skipped_todos": 2
}
Supports both the new template format (Priority | Action | Type | Details | Status) and the legacy format (Action | Status | Details), inferring types from action text for legacy.
3. Woodpecker Pipeline
Location: .woodpecker/postmortem-todos.yml
Trigger: Push to master with changes in docs/post-mortems/*.md
Steps:
-
parse-and-implement: Runs
scripts/postmortem-pipeline.shwhich:- Scans all post-mortems for pending TODOs (no git diff — avoids shallow clone issues)
- Parses safe TODOs via the parser script
- Authenticates to Vault via K8s Service Account JWT
- Fetches DevVM SSH key from
secret/ci/infra→devvm_ssh_key - SSHes to DevVM (10.0.10.10) and runs Claude Code headless
-
notify-slack: Posts pipeline result to Slack
Authentication chain: Woodpecker pod → K8s SA token → Vault K8s auth (role: ci) → secret/data/ci/infra → SSH key → DevVM
4. TODO Resolver Agent
Location: .claude/agents/postmortem-todo-resolver.md
Claude Code agent that runs in headless mode (claude -p --agent postmortem-todo-resolver).
What it does per TODO (in priority order P0 → P3):
- Reads relevant Terraform files
- Implements the change (edit
.tf,.tpl, etc.) - Runs
scripts/tg plan— aborts if any resources would be destroyed - Runs
scripts/tg apply --non-interactive - Commits with:
fix(post-mortem): <action> [PM-YYYY-MM-DD]
After all TODOs:
- Updates the Prevention Plan table:
TODO→Done - Populates the Follow-up Implementation table:
| Date | Action | Priority | Type | Commit | Implemented By |
|---|---|---|---|---|---|
| 2026-04-14 | Add NFS RPC retransmission alert | P2 | Alert | abc1234 |
postmortem-todo-resolver |
| — | Migrate Vault to encrypted PVC | P1 | Migration | — | Needs human review |
Safety guardrails:
- Only implements Alert, Config, Monitor types
- Never modifies platform stacks (vault, dbaas, traefik, authentik)
- Aborts if Terraform plan shows any destroys
- Budget cap: $5 per run
- Skipped items marked as "Needs human review"
5. Cluster Health Auto-Suggest
Location: .claude/skills/cluster-health/SKILL.md
After running a healthcheck, if the cluster recovered from a previous unhealthy state, the skill suggests:
The cluster has recovered. Would you like me to write a post-mortem? Run
/post-mortemto generate one.
Secrets & Configuration
| Secret | Vault Path | Purpose |
|---|---|---|
| DevVM SSH key | secret/ci/infra → devvm_ssh_key |
Woodpecker → DevVM SSH access |
| Slack webhook | Woodpecker global secret slack_webhook |
Pipeline notifications |
| Anthropic API key | ~/.claude/ on DevVM |
Claude Code headless mode |
File Inventory
| File | Type | Description |
|---|---|---|
.claude/skills/post-mortem/skill.md |
Skill | Post-mortem writer definition |
.claude/skills/post-mortem/template.md |
Template | Post-mortem markdown skeleton |
.claude/agents/postmortem-todo-resolver.md |
Agent | Headless TODO implementation agent |
.woodpecker/postmortem-todos.yml |
Pipeline | Woodpecker CI triggered on post-mortem changes |
scripts/postmortem-pipeline.sh |
Script | Pipeline orchestration (parse, auth, SSH, invoke) |
scripts/parse-postmortem-todos.sh |
Script | TODO extraction from markdown |
docs/post-mortems/ |
Directory | All post-mortem documents |
docs/post-mortems/index.html |
Static | Post-mortem index page (deployed to GH Pages) |
Commit Conventions
| Pattern | Used by | Example |
|---|---|---|
fix(post-mortem): <action> [PM-YYYY-MM-DD] |
TODO resolver agent | fix(post-mortem): add NFS alert [PM-2026-04-14] |
docs: post-mortem for <date> <title> [ci skip] |
Post-mortem writer skill | docs: post-mortem for 2026-04-14 NFS outage [ci skip] |
docs: update post-mortem follow-up [PM-YYYY-MM-DD] [ci skip] |
TODO resolver agent | Final update with Follow-up table |
Limitations
- Woodpecker shallow clone: The pipeline scans all post-mortems for TODOs rather than diffing
HEAD~1(shallow clone breaks git history) - Single DevVM: The agent runs on 10.0.10.10 — if DevVM is down, pipeline fails. Could be extended to multiple hosts.
- Anthropic API dependency: Headless Claude Code requires API access. Budget capped at $5 per run.
- No interactive approval: The agent cannot ask for human approval mid-run. Risky items are skipped entirely.