From 30c3450c61acd8922a89afdf4951ac616df24f4d Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Tue, 14 Apr 2026 18:37:37 +0000 Subject: [PATCH] docs: add user-facing issue reporting guide Adds "Reporting an Issue" section with: - Where to report (Slack, GitHub, DM) - What to include (examples of good vs bad reports) - What happens after reporting (flow diagram) - Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/architecture/incident-response.md | 84 ++++++++++++++++++++++++++ 1 file changed, 84 insertions(+) diff --git a/docs/architecture/incident-response.md b/docs/architecture/incident-response.md index 2728a5c1..8b1e652f 100644 --- a/docs/architecture/incident-response.md +++ b/docs/architecture/incident-response.md @@ -1,5 +1,89 @@ # Incident Response & Post-Mortem Pipeline +## Reporting an Issue + +If something is broken or behaving unexpectedly, here's how to report it: + +### Where to report + +| Channel | When to use | Response time | +|---------|-------------|---------------| +| **Slack #alerts** | Service down, can't access something | Minutes | +| **GitHub Issue** on [ViktorBarzin/infra](https://github.com/ViktorBarzin/infra/issues) | Non-urgent bugs, feature requests, recurring problems | Hours | +| **Direct message Viktor** | Emergencies (DNS down, cluster unreachable, data loss risk) | ASAP | + +### What to include + +A good issue report helps us fix things faster. Include: + +1. **What's broken** — which service, URL, or feature +2. **When it started** — approximate time (timezone!) +3. **What you see** — error message, screenshot, HTTP status code +4. **What you expected** — what should have happened + +### Examples + +**Good report:** +> Nextcloud at nextcloud.viktorbarzin.me returns 502 Bad Gateway since ~14:00 UTC. +> Was working fine this morning. Other services (Grafana, Immich) seem fine. + +**Also good (minimal):** +> ha-sofia.viktorbarzin.lan not resolving — getting NXDOMAIN + +**Not helpful:** +> Nothing works + +### What happens after you report + +``` +You report issue + │ + ▼ +Viktor investigates with Claude Code (cluster-health, logs, diagnostics) + │ + ▼ +Fix applied → service restored + │ + ▼ +Post-mortem auto-generated with /post-mortem + │ + ▼ +Post-mortem pushed to repo + │ + ▼ +Automated pipeline implements follow-up fixes (alerts, monitoring, config) + │ + ▼ +Post-mortem updated with implementation links + │ + ▼ +Published at GitHub Pages for review +``` + +You'll be notified in Slack when: +- Your issue is being investigated +- The fix is applied +- The post-mortem is published (with what was done to prevent recurrence) + +### Checking service status + +- **Uptime dashboard**: [uptime.viktorbarzin.me](https://uptime.viktorbarzin.me) — real-time status of all services +- **Post-mortems**: [ViktorBarzin/infra post-mortems](https://github.com/ViktorBarzin/infra/tree/master/docs/post-mortems) — past incidents and their fixes +- **Grafana**: [grafana.viktorbarzin.me](https://grafana.viktorbarzin.me) — metrics and dashboards + +### Common self-service checks + +Before reporting, you can check: + +| Symptom | Quick check | +|---------|-------------| +| Service returns 502/503 | Is the pod running? Check [K8s Dashboard](https://dashboard.viktorbarzin.me) | +| Can't login (SSO) | Try incognito window — might be cached auth | +| Slow performance | Check if the node is under memory pressure in Grafana | +| DNS not resolving | Try `nslookup 10.0.20.201` — if that works, it's client DNS cache | + +--- + ## Overview Automated incident response pipeline that handles the full lifecycle: detection → mitigation → post-mortem generation → TODO implementation → documentation update. Claude Code agents automate both the post-mortem writing and the follow-up remediation, with human review gates for risky changes.