infra/.claude/skills/post-mortem/template.md
Viktor Barzin 460c68e015 feat: add incident management system with user reporting
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
  expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
  → historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
  and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
  postmortem-done on infra repo

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:00:31 +00:00

86 lines
2.2 KiB
Markdown

# Post-Mortem: <TITLE>
| Field | Value |
|-------|-------|
| **Date** | <DATE> |
| **Duration** | <DURATION> |
| **Severity** | <SEV1/SEV2/SEV3> |
| **Affected Services** | <COUNT> pods across <COUNT> namespaces |
| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |
| **Status** | Draft |
## Summary
<1-2 sentence summary of the incident.>
## Impact
- **User-facing**: <What users experienced>
- **Blast radius**: <How many services/pods/namespaces affected>
- **Duration**: <How long the outage lasted>
- **Data loss**: <None/details>
- **Monitoring gap**: <Any blind spots in alerting>
## Timeline (UTC)
| Time | Event |
|------|-------|
| **HH:MM** | <First sign of trouble> |
| **HH:MM** | <Detection / user report> |
| **HH:MM** | <Investigation begins> |
| **HH:MM** | <Root cause identified> |
| **HH:MM** | <Fix applied> |
| **HH:MM** | <Service restored> |
## Root Cause
<Narrative description of what went wrong and why.>
## Contributing Factors
1. <Factor that made the incident worse or harder to detect>
2. <Factor...>
## Detection Gaps
| Gap | Impact | Fix |
|-----|--------|-----|
| <What wasn't monitored> | <How it delayed detection> | <What to add> |
## Prevention Plan
### P0 — Prevent this exact failure
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P0 | <action> | Config | <details> | TODO |
### P1 — Reduce blast radius
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P1 | <action> | Alert | <details> | TODO |
### P2 — Detect faster
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P2 | <action> | Monitor | <details> | TODO |
### P3 — Improve resilience
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P3 | <action> | Architecture | <details> | TODO |
## Lessons Learned
1. <Key takeaway>
2. <Key takeaway>
## Follow-up Implementation
_This section is auto-populated by the postmortem-todo-resolver agent._
| Date | Action | Priority | Type | Commit | Implemented By |
|------|--------|----------|------|--------|----------------|