## Context
The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API
for running Claude headless agents. Three workflows still SSH'd to the DevVM
(10.0.10.10) to invoke `claude -p`. This eliminates that dependency.
## This change
Pipeline migrations (SSH → HTTP POST to claude-agent-service):
- `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead
of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation
- `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON
construction of TODO payloads
- `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install
- `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP
Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault
secret/n8n)
Documentation updates:
- `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s
- `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action
- `AGENTS.md` — pipeline description updated
## What is NOT in this change
- DevVM decommissioning (still hosts terminal/foolery services)
- Removal of SSH key secrets from Vault (kept for rollback)
- n8n workflow import (must be done manually in n8n UI)
[ci skip]
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
10 KiB
Contributing to the Infrastructure
Welcome! This doc explains how to report issues, request features, and what happens behind the scenes.
Quick Links
| What | Where |
|---|---|
| Report an outage | File an issue |
| Request a feature | File a request |
| Check service status | status.viktorbarzin.me |
| View past incidents | Post-mortems |
| Uptime dashboard | uptime.viktorbarzin.me |
| Grafana dashboards | grafana.viktorbarzin.me |
Reporting an Outage
If something is broken, file an outage report. The form asks for:
- Which service is affected (dropdown)
- What you see (error message, behavior)
- What kind of error (502, timeout, auth, slow, etc.)
- When it started
- Is it just you or others too?
What makes a good report
Good:
Nextcloud at nextcloud.viktorbarzin.me returns 502 Bad Gateway since ~14:00 UTC. Other services seem fine. Tried incognito — same result.
Also good (minimal):
Home Assistant not loading since this morning
Not helpful:
Nothing works
What happens after you report
flowchart TD
A["You file a GitHub Issue<br/>(outage-report template)"] --> B["GitHub Actions triggers<br/>(within seconds)"]
B --> C{Are you a<br/>collaborator?}
C -->|No| D["'Queued for review'<br/>comment added"]
D --> E["Viktor reviews manually"]
C -->|Yes| F["Automated agent<br/>starts investigating"]
F --> G{Is the service<br/>actually down?}
G -->|"Healthy"| H["Agent posts findings<br/>+ closes issue"]
G -->|"Down"| I["Agent classifies severity<br/>(SEV1 / SEV2 / SEV3)"]
I --> J{Can the agent<br/>fix it?}
J -->|"Yes (confident)"| K["Agent applies fix<br/>+ posts resolution"]
J -->|"No (complex)"| L["Agent escalates<br/>to Viktor"]
K --> M["Post-mortem written<br/>+ published"]
L --> N["Viktor investigates<br/>+ fixes manually"]
N --> M
M --> O["Status page updated<br/>at status.viktorbarzin.me"]
style A fill:#6366f1,color:#fff
style F fill:#22c55e,color:#fff
style K fill:#22c55e,color:#fff
style L fill:#f59e0b,color:#000
style M fill:#3b82f6,color:#fff
What to expect
| Scenario | Response time | Who handles it |
|---|---|---|
| Service is actually healthy | ~5 minutes | Automated agent checks and closes |
| Simple fix (pod restart, config) | ~10 minutes | Automated agent fixes and reports |
| Complex issue (data, architecture) | ~30 min to acknowledge | Agent investigates, escalates to Viktor |
| Non-collaborator report | Hours | Queued for manual review |
After resolution
For SEV1 and SEV2 incidents, a post-mortem is automatically written documenting:
- What happened and the timeline
- Root cause analysis
- What was done to prevent recurrence
Post-mortems are published at viktorbarzin.github.io/infra/post-mortems.
Requesting a Feature
Want a new service deployed, a config change, or a new monitor? File a feature request.
Just describe what you need — be specific.
What happens after you request
flowchart TD
A["You file a GitHub Issue<br/>(feature-request template)"] --> B["GitHub Actions triggers"]
B --> C{Are you a<br/>collaborator?}
C -->|No| D["'Queued for review'<br/>comment added"]
C -->|Yes| E["Automated agent<br/>assesses the request"]
E --> F{Is it<br/>straightforward?}
F -->|"Yes"| G["Agent implements it<br/>(Terraform + apply)"]
G --> H["Agent comments<br/>what was done"]
H --> I["Issue closed"]
F -->|"No (complex)"| J["Agent posts assessment:<br/>what's needed, risks, effort"]
J --> K["Escalated to Viktor<br/>for review"]
style A fill:#6366f1,color:#fff
style G fill:#22c55e,color:#fff
style K fill:#f59e0b,color:#000
Examples of what the agent can do automatically
- Add an Uptime Kuma monitor for a service
- Deploy a known service (Helm chart or standard Terraform stack)
- Change resource limits, replica counts
- Add a DNS record
- Configure an ingress route
Examples of what gets escalated
- Deploy a completely new/unknown service
- Architecture changes (HA, storage migration)
- Changes to core platform (auth, DNS, ingress, databases)
- Anything involving data migration or secrets
Before Reporting — Self-Service Checks
| Symptom | Quick check |
|---|---|
| Service returns 502/503 | Check status page — is the service shown as down? |
| Can't login (SSO) | Try incognito window — might be cached auth cookie |
| Slow performance | Check Grafana for node memory/CPU pressure |
| DNS not resolving | Try nslookup <domain> 10.0.20.201 — if that works, flush your DNS cache |
| VPN not connecting | Check Headscale admin for your device status |
Severity Levels
| Level | Definition | Examples | Response |
|---|---|---|---|
| SEV1 | Critical — multiple services down, data at risk, core infra outage | DNS down, auth broken, cluster node unreachable | Immediate automated investigation + escalation |
| SEV2 | Major — single important service down or significantly degraded | Nextcloud 502, Immich not loading, mail not sending | Automated investigation, fix if possible |
| SEV3 | Minor — limited impact, workaround available, cosmetic | Slow dashboard, one monitor flapping, non-critical CronJob failed | Noted, fixed when convenient |
Status Page
The status page at status.viktorbarzin.me shows:
- Live service status — updated every 5 minutes from Uptime Kuma monitors
- Active incidents — SEV-classified with timelines and affected services
- User reports — issues filed by users, with error type and scope
- Recently resolved — incidents closed in the last 7 days with postmortem links
The status page is hosted on GitHub Pages — it stays up even when the cluster is down.
Architecture (Technical Details)
For contributors who want to understand how the automation works.
End-to-End Flow
flowchart LR
subgraph GitHub
A[Issue Created] --> B[GHA Workflow]
B --> C{Collaborator?}
end
subgraph "Kubernetes Cluster"
C -->|Yes| D[Woodpecker Pipeline]
D --> E[Vault Auth<br/>K8s SA JWT]
E --> F[Fetch API Token]
end
subgraph "claude-agent-service (K8s)"
F --> G[HTTP POST /execute]
G --> H[issue-responder agent]
H --> I[Investigate / Implement]
I --> J[Comment on Issue]
I --> K[Terraform Apply]
I --> L[Post-Mortem Pipeline]
end
subgraph "Post-Mortem Pipeline"
L --> M[sev-triage<br/>haiku, ~60s]
M --> N[Specialists<br/>3-5 agents parallel]
N --> O[sev-historian<br/>cross-ref past incidents]
O --> P[sev-report-writer<br/>write report + action items]
P --> Q[postmortem-todo-resolver<br/>implement safe fixes]
end
style B fill:#2088ff,color:#fff
style D fill:#4c9e47,color:#fff
style H fill:#6366f1,color:#fff
style Q fill:#6366f1,color:#fff
Components
| Component | Location | Purpose |
|---|---|---|
| GHA Workflow | .github/workflows/issue-automation.yml |
Triggers on issue creation, checks collaborator, POSTs to Woodpecker |
| Woodpecker Pipeline | .woodpecker/issue-automation.yml |
Authenticates to Vault, SSHes to DevVM, runs Claude agent |
| Issue Responder | .claude/agents/issue-responder.md |
Reads issue, classifies, investigates, fixes or escalates |
| Post-Mortem Orchestrator | .claude/agents/post-mortem.md |
4-stage investigation pipeline |
| SEV Triage | .claude/agents/sev-triage.md |
Fast cluster scan + severity classification |
| SEV Historian | .claude/agents/sev-historian.md |
Cross-references past incidents |
| SEV Report Writer | .claude/agents/sev-report-writer.md |
Writes final postmortem + links to issue |
| TODO Resolver | .claude/agents/postmortem-todo-resolver.md |
Implements safe follow-up fixes |
| Post-Mortem Skill | .claude/skills/post-mortem/ |
Manual /post-mortem command |
| Cluster Health | .claude/skills/cluster-health/ |
Health check with auto-filing for SEV1/SEV2 |
| Status Page CronJob | stacks/status-page/main.tf |
Pushes status + incidents to GitHub Pages every 5 min |
| Issue Templates | .github/ISSUE_TEMPLATE/ |
Structured forms for outage reports + feature requests |
Safety Guardrails
The automated agent follows strict rules:
- All changes go through Terraform — never
kubectl applyas final state terraform planbefore every apply — aborts if any resources would be destroyed- Platform stacks are hands-off — vault, dbaas, traefik, authentik, kyverno always escalate
- No data deletion — never deletes PVCs, PVs, or user data
- Budget capped — $10 max per issue, $5 per post-mortem run
- Complex = escalate — if the agent isn't confident, it assigns to Viktor with findings
Labels
| Label | Purpose |
|---|---|
user-report |
Auto-applied to outage reports |
feature-request |
Auto-applied to feature requests |
incident |
Confirmed incident (appears on status page) |
sev1 / sev2 / sev3 |
Severity classification |
postmortem-required |
SEV needs a postmortem |
postmortem-done |
Postmortem written and linked |
needs-human |
Agent escalated — needs Viktor's attention |
Commit Conventions
| Pattern | Used by |
|---|---|
feat: <desc> (fixes #N) |
Issue responder (feature implementations) |
fix: <desc> (fixes #N) |
Issue responder (incident fixes) |
fix(post-mortem): <action> [PM-YYYY-MM-DD] |
Post-mortem TODO resolver |
docs: post-mortem for <date> <title> [ci skip] |
Post-mortem writer |