Viktor Barzin 30c3450c61 docs: add user-facing issue reporting guide

Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 18:37:37 +00:00

10 KiB

Raw Blame History

Incident Response & Post-Mortem Pipeline

Reporting an Issue

If something is broken or behaving unexpectedly, here's how to report it:

Where to report

Channel	When to use	Response time
Slack #alerts	Service down, can't access something	Minutes
GitHub Issue on ViktorBarzin/infra	Non-urgent bugs, feature requests, recurring problems	Hours
Direct message Viktor	Emergencies (DNS down, cluster unreachable, data loss risk)	ASAP

What to include

A good issue report helps us fix things faster. Include:

What's broken — which service, URL, or feature
When it started — approximate time (timezone!)
What you see — error message, screenshot, HTTP status code
What you expected — what should have happened

Examples

Good report:

Nextcloud at nextcloud.viktorbarzin.me returns 502 Bad Gateway since ~14:00 UTC. Was working fine this morning. Other services (Grafana, Immich) seem fine.

Also good (minimal):

ha-sofia.viktorbarzin.lan not resolving — getting NXDOMAIN

Not helpful:

Nothing works

What happens after you report

You report issue
    │
    ▼
Viktor investigates with Claude Code (cluster-health, logs, diagnostics)
    │
    ▼
Fix applied → service restored
    │
    ▼
Post-mortem auto-generated with /post-mortem
    │
    ▼
Post-mortem pushed to repo
    │
    ▼
Automated pipeline implements follow-up fixes (alerts, monitoring, config)
    │
    ▼
Post-mortem updated with implementation links
    │
    ▼
Published at GitHub Pages for review

You'll be notified in Slack when:

Your issue is being investigated
The fix is applied
The post-mortem is published (with what was done to prevent recurrence)

Checking service status

Uptime dashboard: uptime.viktorbarzin.me — real-time status of all services
Post-mortems: ViktorBarzin/infra post-mortems — past incidents and their fixes
Grafana: grafana.viktorbarzin.me — metrics and dashboards

Common self-service checks

Before reporting, you can check:

Symptom	Quick check
Service returns 502/503	Is the pod running? Check K8s Dashboard
Can't login (SSO)	Try incognito window — might be cached auth
Slow performance	Check if the node is under memory pressure in Grafana
DNS not resolving	Try `nslookup <domain> 10.0.20.201` — if that works, it's client DNS cache

Overview

Automated incident response pipeline that handles the full lifecycle: detection → mitigation → post-mortem generation → TODO implementation → documentation update. Claude Code agents automate both the post-mortem writing and the follow-up remediation, with human review gates for risky changes.

Architecture Diagram

graph TD
    A[Incident Detected] --> B[Interactive Mitigation]
    B --> C{Cluster Healthy?}
    C -->|No| B
    C -->|Yes| D[post-mortem skill]
    D --> E[git push post-mortem]
    E --> F[GitHub Webhook]
    F --> G[Woodpecker Pipeline]
    G --> H[Parse safe TODOs]
    H --> I{Safe TODOs?}
    I -->|None| J[Slack: nothing to do]
    I -->|Found| K[Vault Auth via K8s SA]
    K --> L[Fetch SSH Key]
    L --> M[SSH to DevVM]
    M --> N[Claude Code Headless Agent]
    N --> O[Terraform plan + apply]
    O --> P[Update Post-Mortem]
    P --> Q[git push]
    Q --> R[GHA: GitHub Pages]
    Q --> S[Slack Notification]

    style B fill:#6366f1
    style D fill:#6366f1
    style G fill:#4c9e47
    style N fill:#6366f1
    style R fill:#2088ff

Components

1. Post-Mortem Writer Skill

Location: .claude/skills/post-mortem/

File	Purpose
`skill.md`	Skill definition — triggered by `/post-mortem` command
`template.md`	Standard post-mortem markdown template

When to use: After mitigating an incident. Auto-suggested when cluster health transitions UNHEALTHY → HEALTHY.

What it generates:

Standard fields (date, duration, severity, affected services)
Timeline from investigation session
Root cause chain
Prevention Plan with TODO table (Priority, Action, Type, Details, Status)
Lessons learned
Follow-up Implementation table (auto-populated by agent)

Type column is critical for automation:

Type	Auto-implementable?	Examples
`Alert`	Yes	PrometheusRule, alert thresholds
`Config`	Yes	Terraform config, NFS options
`Monitor`	Yes	Uptime Kuma HTTP/TCP monitor
`Architecture`	No — human review	Storage migration, HA redesign
`Investigation`	No — human review	Research, root cause analysis
`Migration`	No — human review	Data or service migration
`Runbook`	No — human review	Document recovery procedure

2. TODO Parser

Location: scripts/parse-postmortem-todos.sh

Shell script (POSIX sh + python3) that:

Scans a post-mortem markdown file for TODO items in Prevention Plan tables
Classifies each TODO as safe (Alert/Config/Monitor) or unsafe
Outputs structured JSON:

{
  "file": "docs/post-mortems/2026-04-14-example.md",
  "todos": [{"priority": "P2", "action": "Add NFS alert", "type": "Alert", "details": "...", "safe": true}],
  "skipped": [{"priority": "P1", "action": "Migrate Vault", "type": "Migration", "details": "...", "safe": false}],
  "safe_todos": 3,
  "skipped_todos": 2
}

3. Woodpecker Pipeline

Location: .woodpecker/postmortem-todos.yml

Trigger: Push to master with changes in docs/post-mortems/*.md

Steps:

parse-and-implement: Runs scripts/postmortem-pipeline.sh which:
- Scans all post-mortems for pending TODOs (no git diff — avoids shallow clone issues)
- Parses safe TODOs via the parser script
- Authenticates to Vault via K8s Service Account JWT
- Fetches DevVM SSH key from secret/ci/infra → devvm_ssh_key
- SSHes to DevVM (10.0.10.10) and runs Claude Code headless
notify-slack: Posts pipeline result to Slack

Authentication chain: Woodpecker pod → K8s SA token → Vault K8s auth (role: ci) → secret/data/ci/infra → SSH key → DevVM

4. TODO Resolver Agent

Location: .claude/agents/postmortem-todo-resolver.md

Claude Code agent that runs in headless mode (claude -p --agent postmortem-todo-resolver).

What it does per TODO (in priority order P0 → P3):

Reads relevant Terraform files
Implements the change (edit .tf, .tpl, etc.)
Runs scripts/tg plan — aborts if any resources would be destroyed
Runs scripts/tg apply --non-interactive
Commits with: fix(post-mortem): <action> [PM-YYYY-MM-DD]

After all TODOs:

Updates the Prevention Plan table: TODO → Done
Populates the Follow-up Implementation table:

Date	Action	Priority	Type	Commit	Implemented By
2026-04-14	Add NFS RPC retransmission alert	P2	Alert	`abc1234`	postmortem-todo-resolver
—	Migrate Vault to encrypted PVC	P1	Migration	—	Needs human review

Safety guardrails:

Only implements Alert, Config, Monitor types
Never modifies platform stacks (vault, dbaas, traefik, authentik)
Aborts if Terraform plan shows any destroys
Budget cap: $5 per run
Skipped items marked as "Needs human review"

5. Cluster Health Auto-Suggest

Location: .claude/skills/cluster-health/SKILL.md

After running a healthcheck, if the cluster recovered from a previous unhealthy state, the skill suggests:

The cluster has recovered. Would you like me to write a post-mortem? Run /post-mortem to generate one.

Secrets & Configuration

Secret	Vault Path	Purpose
DevVM SSH key	`secret/ci/infra` → `devvm_ssh_key`	Woodpecker → DevVM SSH access
Slack webhook	Woodpecker global secret `slack_webhook`	Pipeline notifications
Anthropic API key	`~/.claude/` on DevVM	Claude Code headless mode

File Inventory

File	Type	Description
`.claude/skills/post-mortem/skill.md`	Skill	Post-mortem writer definition
`.claude/skills/post-mortem/template.md`	Template	Post-mortem markdown skeleton
`.claude/agents/postmortem-todo-resolver.md`	Agent	Headless TODO implementation agent
`.woodpecker/postmortem-todos.yml`	Pipeline	Woodpecker CI triggered on post-mortem changes
`scripts/postmortem-pipeline.sh`	Script	Pipeline orchestration (parse, auth, SSH, invoke)
`scripts/parse-postmortem-todos.sh`	Script	TODO extraction from markdown
`docs/post-mortems/`	Directory	All post-mortem documents
`docs/post-mortems/index.html`	Static	Post-mortem index page (deployed to GH Pages)

Commit Conventions

Pattern	Used by	Example
`fix(post-mortem): <action> [PM-YYYY-MM-DD]`	TODO resolver agent	`fix(post-mortem): add NFS alert [PM-2026-04-14]`
`docs: post-mortem for <date> <title> [ci skip]`	Post-mortem writer skill	`docs: post-mortem for 2026-04-14 NFS outage [ci skip]`
`docs: update post-mortem follow-up [PM-YYYY-MM-DD] [ci skip]`	TODO resolver agent	Final update with Follow-up table

Limitations

Woodpecker shallow clone: The pipeline scans all post-mortems for TODOs rather than diffing HEAD~1 (shallow clone breaks git history)
Single DevVM: The agent runs on 10.0.10.10 — if DevVM is down, pipeline fails. Could be extended to multiple hosts.
Anthropic API dependency: Headless Claude Code requires API access. Budget capped at $5 per run.
No interactive approval: The agent cannot ask for human approval mid-run. Risky items are skipped entirely.

10 KiB Raw Blame History