infra/docs/architecture/incident-response.md
Viktor Barzin 42f1c3cf4f [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP
## Context

The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API
for running Claude headless agents. Three workflows still SSH'd to the DevVM
(10.0.10.10) to invoke `claude -p`. This eliminates that dependency.

## This change

Pipeline migrations (SSH → HTTP POST to claude-agent-service):
- `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead
  of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation
- `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON
  construction of TODO payloads
- `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install
- `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP
  Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault
  secret/n8n)

Documentation updates:
- `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s
- `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action
- `AGENTS.md` — pipeline description updated

## What is NOT in this change

- DevVM decommissioning (still hosts terminal/foolery services)
- Removal of SSH key secrets from Vault (kept for rollback)
- n8n workflow import (must be done manually in n8n UI)

[ci skip]

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 10:12:02 +00:00

10 KiB

Contributing to the Infrastructure

Welcome! This doc explains how to report issues, request features, and what happens behind the scenes.

What Where
Report an outage File an issue
Request a feature File a request
Check service status status.viktorbarzin.me
View past incidents Post-mortems
Uptime dashboard uptime.viktorbarzin.me
Grafana dashboards grafana.viktorbarzin.me

Reporting an Outage

If something is broken, file an outage report. The form asks for:

  • Which service is affected (dropdown)
  • What you see (error message, behavior)
  • What kind of error (502, timeout, auth, slow, etc.)
  • When it started
  • Is it just you or others too?

What makes a good report

Good:

Nextcloud at nextcloud.viktorbarzin.me returns 502 Bad Gateway since ~14:00 UTC. Other services seem fine. Tried incognito — same result.

Also good (minimal):

Home Assistant not loading since this morning

Not helpful:

Nothing works

What happens after you report

flowchart TD
    A["You file a GitHub Issue<br/>(outage-report template)"] --> B["GitHub Actions triggers<br/>(within seconds)"]
    B --> C{Are you a<br/>collaborator?}
    C -->|No| D["'Queued for review'<br/>comment added"]
    D --> E["Viktor reviews manually"]
    C -->|Yes| F["Automated agent<br/>starts investigating"]
    F --> G{Is the service<br/>actually down?}
    G -->|"Healthy"| H["Agent posts findings<br/>+ closes issue"]
    G -->|"Down"| I["Agent classifies severity<br/>(SEV1 / SEV2 / SEV3)"]
    I --> J{Can the agent<br/>fix it?}
    J -->|"Yes (confident)"| K["Agent applies fix<br/>+ posts resolution"]
    J -->|"No (complex)"| L["Agent escalates<br/>to Viktor"]
    K --> M["Post-mortem written<br/>+ published"]
    L --> N["Viktor investigates<br/>+ fixes manually"]
    N --> M
    M --> O["Status page updated<br/>at status.viktorbarzin.me"]

    style A fill:#6366f1,color:#fff
    style F fill:#22c55e,color:#fff
    style K fill:#22c55e,color:#fff
    style L fill:#f59e0b,color:#000
    style M fill:#3b82f6,color:#fff

What to expect

Scenario Response time Who handles it
Service is actually healthy ~5 minutes Automated agent checks and closes
Simple fix (pod restart, config) ~10 minutes Automated agent fixes and reports
Complex issue (data, architecture) ~30 min to acknowledge Agent investigates, escalates to Viktor
Non-collaborator report Hours Queued for manual review

After resolution

For SEV1 and SEV2 incidents, a post-mortem is automatically written documenting:

  • What happened and the timeline
  • Root cause analysis
  • What was done to prevent recurrence

Post-mortems are published at viktorbarzin.github.io/infra/post-mortems.


Requesting a Feature

Want a new service deployed, a config change, or a new monitor? File a feature request.

Just describe what you need — be specific.

What happens after you request

flowchart TD
    A["You file a GitHub Issue<br/>(feature-request template)"] --> B["GitHub Actions triggers"]
    B --> C{Are you a<br/>collaborator?}
    C -->|No| D["'Queued for review'<br/>comment added"]
    C -->|Yes| E["Automated agent<br/>assesses the request"]
    E --> F{Is it<br/>straightforward?}
    F -->|"Yes"| G["Agent implements it<br/>(Terraform + apply)"]
    G --> H["Agent comments<br/>what was done"]
    H --> I["Issue closed"]
    F -->|"No (complex)"| J["Agent posts assessment:<br/>what's needed, risks, effort"]
    J --> K["Escalated to Viktor<br/>for review"]

    style A fill:#6366f1,color:#fff
    style G fill:#22c55e,color:#fff
    style K fill:#f59e0b,color:#000

Examples of what the agent can do automatically

  • Add an Uptime Kuma monitor for a service
  • Deploy a known service (Helm chart or standard Terraform stack)
  • Change resource limits, replica counts
  • Add a DNS record
  • Configure an ingress route

Examples of what gets escalated

  • Deploy a completely new/unknown service
  • Architecture changes (HA, storage migration)
  • Changes to core platform (auth, DNS, ingress, databases)
  • Anything involving data migration or secrets

Before Reporting — Self-Service Checks

Symptom Quick check
Service returns 502/503 Check status page — is the service shown as down?
Can't login (SSO) Try incognito window — might be cached auth cookie
Slow performance Check Grafana for node memory/CPU pressure
DNS not resolving Try nslookup <domain> 10.0.20.201 — if that works, flush your DNS cache
VPN not connecting Check Headscale admin for your device status

Severity Levels

Level Definition Examples Response
SEV1 Critical — multiple services down, data at risk, core infra outage DNS down, auth broken, cluster node unreachable Immediate automated investigation + escalation
SEV2 Major — single important service down or significantly degraded Nextcloud 502, Immich not loading, mail not sending Automated investigation, fix if possible
SEV3 Minor — limited impact, workaround available, cosmetic Slow dashboard, one monitor flapping, non-critical CronJob failed Noted, fixed when convenient

Status Page

The status page at status.viktorbarzin.me shows:

  • Live service status — updated every 5 minutes from Uptime Kuma monitors
  • Active incidents — SEV-classified with timelines and affected services
  • User reports — issues filed by users, with error type and scope
  • Recently resolved — incidents closed in the last 7 days with postmortem links

The status page is hosted on GitHub Pages — it stays up even when the cluster is down.


Architecture (Technical Details)

For contributors who want to understand how the automation works.

End-to-End Flow

flowchart LR
    subgraph GitHub
        A[Issue Created] --> B[GHA Workflow]
        B --> C{Collaborator?}
    end

    subgraph "Kubernetes Cluster"
        C -->|Yes| D[Woodpecker Pipeline]
        D --> E[Vault Auth<br/>K8s SA JWT]
        E --> F[Fetch API Token]
    end

    subgraph "claude-agent-service (K8s)"
        F --> G[HTTP POST /execute]
        G --> H[issue-responder agent]
        H --> I[Investigate / Implement]
        I --> J[Comment on Issue]
        I --> K[Terraform Apply]
        I --> L[Post-Mortem Pipeline]
    end

    subgraph "Post-Mortem Pipeline"
        L --> M[sev-triage<br/>haiku, ~60s]
        M --> N[Specialists<br/>3-5 agents parallel]
        N --> O[sev-historian<br/>cross-ref past incidents]
        O --> P[sev-report-writer<br/>write report + action items]
        P --> Q[postmortem-todo-resolver<br/>implement safe fixes]
    end

    style B fill:#2088ff,color:#fff
    style D fill:#4c9e47,color:#fff
    style H fill:#6366f1,color:#fff
    style Q fill:#6366f1,color:#fff

Components

Component Location Purpose
GHA Workflow .github/workflows/issue-automation.yml Triggers on issue creation, checks collaborator, POSTs to Woodpecker
Woodpecker Pipeline .woodpecker/issue-automation.yml Authenticates to Vault, SSHes to DevVM, runs Claude agent
Issue Responder .claude/agents/issue-responder.md Reads issue, classifies, investigates, fixes or escalates
Post-Mortem Orchestrator .claude/agents/post-mortem.md 4-stage investigation pipeline
SEV Triage .claude/agents/sev-triage.md Fast cluster scan + severity classification
SEV Historian .claude/agents/sev-historian.md Cross-references past incidents
SEV Report Writer .claude/agents/sev-report-writer.md Writes final postmortem + links to issue
TODO Resolver .claude/agents/postmortem-todo-resolver.md Implements safe follow-up fixes
Post-Mortem Skill .claude/skills/post-mortem/ Manual /post-mortem command
Cluster Health .claude/skills/cluster-health/ Health check with auto-filing for SEV1/SEV2
Status Page CronJob stacks/status-page/main.tf Pushes status + incidents to GitHub Pages every 5 min
Issue Templates .github/ISSUE_TEMPLATE/ Structured forms for outage reports + feature requests

Safety Guardrails

The automated agent follows strict rules:

  • All changes go through Terraform — never kubectl apply as final state
  • terraform plan before every apply — aborts if any resources would be destroyed
  • Platform stacks are hands-off — vault, dbaas, traefik, authentik, kyverno always escalate
  • No data deletion — never deletes PVCs, PVs, or user data
  • Budget capped — $10 max per issue, $5 per post-mortem run
  • Complex = escalate — if the agent isn't confident, it assigns to Viktor with findings

Labels

Label Purpose
user-report Auto-applied to outage reports
feature-request Auto-applied to feature requests
incident Confirmed incident (appears on status page)
sev1 / sev2 / sev3 Severity classification
postmortem-required SEV needs a postmortem
postmortem-done Postmortem written and linked
needs-human Agent escalated — needs Viktor's attention

Commit Conventions

Pattern Used by
feat: <desc> (fixes #N) Issue responder (feature implementations)
fix: <desc> (fixes #N) Issue responder (incident fixes)
fix(post-mortem): <action> [PM-YYYY-MM-DD] Post-mortem TODO resolver
docs: post-mortem for <date> <title> [ci skip] Post-mortem writer