Viktor Barzin 460c68e015 feat: add incident management system with user reporting

- Status page (status.viktorbarzin.me): incident cards with SEV badges,
  expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
  → historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
  and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
  postmortem-done on infra repo

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 20:00:31 +00:00

3.7 KiB

Raw Blame History

Post-Mortem Writer

Generate a structured post-mortem document after an incident mitigation session.

When to use

After /post-mortem command
Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY

Instructions

Gather context:
- Run .claude/scripts/sev-context.sh to capture current cluster state
- Review the conversation history for: what broke, timeline, root cause, what was fixed
- Check existing post-mortems at docs/post-mortems/ for format reference
Generate the post-mortem:
- Use the template at .claude/skills/post-mortem/template.md
- Fill in all sections from the investigation context
- Critical: In the Prevention Plan tables, set the Type column correctly:
  - Alert — add/modify Prometheus alerting rules (auto-implementable)
  - Config — change Terraform config, NFS options, etc. (auto-implementable)
  - Monitor — add Uptime Kuma monitors (auto-implementable)
  - Architecture — storage migration, stack redesign (human-only)
  - Investigation — needs further research (human-only)
  - Runbook — document a procedure (human-only)
  - Migration — data or service migration (human-only)
- Items already fixed during the session should have Status = Done
- Items not yet done should have Status = TODO
File naming: docs/post-mortems/<YYYY-MM-DD>-<slug>.md
- Slug: lowercase, hyphenated, max 5 words describing the incident
Update index: Add an entry to docs/post-mortems/index.html
- Add a new card in the incidents grid with date, severity tag, title, description

Link to GitHub Issue (if an issue exists for this incident):

Fill in the Issue field in the template metadata table with [#N](https://github.com/ViktorBarzin/infra/issues/N)

Add a comment to the GitHub Issue linking the postmortem:

GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -s -X POST \
  -H "Authorization: token $GITHUB_TOKEN" \
  -H "Accept: application/vnd.github.v3+json" \
  "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
  -d '{"body": "**Postmortem:** [View postmortem](https://viktorbarzin.github.io/infra/post-mortems/<YYYY-MM-DD>-<slug>)"}'

Add the postmortem-done label and remove postmortem-required:

curl -s -X POST \
  -H "Authorization: token $GITHUB_TOKEN" \
  "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" \
  -d '{"labels": ["postmortem-done"]}'
curl -s -X DELETE \
  -H "Authorization: token $GITHUB_TOKEN" \
  "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels/postmortem-required"

If no issue exists, create one with labels incident, sev<N>, postmortem-done

Commit and push:
```
git add docs/post-mortems/<file>.md docs/post-mortems/index.html
git commit -m "docs: post-mortem for <date> <title> [ci skip]"
git push origin master
```
- Use [ci skip] to avoid triggering app-stacks pipeline
- NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter)

Type Reference for Prevention Plan

Type	Auto-implementable?	Examples
Alert	Yes	Add PrometheusRule, modify alert thresholds
Config	Yes	Change Terraform variables, mount options, CronJob schedules
Monitor	Yes	Add Uptime Kuma HTTP/TCP monitor
Architecture	No	Migrate storage class, redesign HA topology
Investigation	No	Research kernel bug, check Proxmox forum
Runbook	No	Document recovery procedure
Migration	No	Move data between storage backends

3.7 KiB Raw Blame History

Post-Mortem Writer

When to use

Instructions

Type Reference for Prevention Plan

3.7 KiB

Raw Blame History