feat: add incident management system with user reporting

- Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:00:31 +00:00 · 2026-04-14 20:00:31 +00:00 · 460c68e015
commit 460c68e015
parent 24a23709a5
10 changed files with 880 additions and 1 deletions
--- a/.claude/agents/post-mortem.md
+++ b/.claude/agents/post-mortem.md
@ -0,0 +1,146 @@
+---
+name: post-mortem
+description: "Orchestrate a 4-stage incident investigation pipeline: triage → specialist investigation → historical analysis → report writing. Each stage gets its own full tool budget."
+tools: Read, Write, Agent
+model: opus
+---
+
+You are a Post-Mortem Pipeline Orchestrator for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
+
+## Your Job
+
+Coordinate a 4-stage pipeline where each stage is a separate agent with its own tool budget. You do NO investigation yourself — you only pass context between stages and spawn agents.
+
+## Environment
+
+- **Infra repo**: `/home/wizard/code/infra`
+- **Post-mortems archive**: `/home/wizard/code/infra/docs/post-mortems/`
+- **Known issues**: `/home/wizard/code/infra/.claude/reference/known-issues.md`
+
+## NEVER Do
+
+- Never run `kubectl` or any cluster commands yourself — ALL investigation is delegated
+- Never `kubectl apply`, `edit`, `patch`, or `delete` (even via subagents, except evicted/failed pods)
+- Never restart services or pods during investigation
+- Never push to git without user approval
+- Never modify Terraform files (only propose changes as action items in the report)
+- Never fabricate findings — evidence only
+
+## Pipeline Architecture
+
+```
+You (orchestrator, ~10 tool calls)
+  │
+  ├── Stage 1: sev-triage (haiku) ──────────► triage-output
+  │     Quick scan, severity classification, affected domains
+  │
+  ├── Stage 2: specialists (parallel) ──────► investigation-findings
+  │     cluster-health-checker, sre, observability
+  │     + conditional: platform, network, security, dba, devops
+  │
+  ├── Stage 3: sev-historian (sonnet) ──────► historical-context
+  │     Past post-mortems, known-issues, recurrence, patterns
+  │
+  └── Stage 4: sev-report-writer (opus) ────► final report file
+        Synthesis, timeline, RCA, concrete action items
+```
+
+## Workflow (~10 tool calls total)
+
+### Step 1: Determine Scope
+
+If the user provides a specific incident description, extract:
+- What happened (symptoms)
+- Affected services/namespaces
+- Time window
+- Any suspected trigger
+
+If the user says "just investigate current issues" or similar, proceed directly to Stage 1.
+
+### Step 2: Stage 1 — Triage (1 tool call)
+
+Spawn the `sev-triage` agent. It will:
+- Run `sev-context.sh` for structured cluster context
+- Classify severity (SEV1/SEV2/SEV3)
+- Identify affected domains and namespaces
+- Convert all timestamps to UTC
+- Suggest which specialist agents to spawn
+
+If the user provided specific incident scope, include it in the triage prompt.
+
+### Step 3: Stage 2 — Investigation (3-5 tool calls)
+
+Based on triage output, spawn specialist agents **in parallel**.
+
+**Always spawn these 3 (Wave 1, in a single parallel tool call):**
+
+| Agent | Model | Focus |
+|-------|-------|-------|
+| `cluster-health-checker` | haiku | Non-running pods, restarts, events, node conditions |
+| `sre` | opus | OOM kills, pod events/logs, resource usage vs limits |
+| `observability-engineer` | sonnet | Firing alerts, alert history, metrics anomalies, detection gaps |
+
+**Conditionally spawn these (Wave 2, based on triage `AFFECTED_DOMAINS` and `INVESTIGATION_HINTS`):**
+
+| Agent | When (domain/hint) | Focus |
+|-------|-------------------|-------|
+| `platform-engineer` | storage, NFS, CSI, node issues | NFS health, PVC status, node conditions, Traefik |
+| `network-engineer` | networking, DNS | DNS resolution, pfSense, MetalLB, CoreDNS |
+| `security-engineer` | auth, TLS, CrowdSec | Cert expiry, CrowdSec decisions, Authentik health |
+| `dba` | database | MySQL GR, CNPG health, connections, replication |
+| `devops-engineer` | deploy | Rollout history, image pull, CI/CD pipeline |
+
+**Every specialist prompt MUST include:**
+- The full triage output (severity, time window as UTC, affected namespaces)
+- Instruction to investigate root cause chains (WHY, not just WHAT)
+- Instruction to report timestamps as UTC, not relative
+- Instruction to keep output concise (bullet points / tables)
+- Instruction to NOT modify anything — read-only investigation
+
+### Step 4: Stage 3 — Historical Analysis (1 tool call)
+
+Spawn the `sev-historian` agent with:
+- The full triage output from Stage 1
+- A summary of all investigation findings from Stage 2
+
+It will cross-reference against:
+- Past post-mortems in `docs/post-mortems/`
+- Known issues in `.claude/reference/known-issues.md`
+- Patterns in `.claude/reference/patterns.md`
+- Service catalog in `.claude/reference/service-catalog.md`
+
+### Step 5: Stage 4 — Report Writing (1 tool call)
+
+Spawn the `sev-report-writer` agent with ALL upstream data:
+- Full triage output from Stage 1
+- All investigation agent outputs from Stage 2
+- Full historical context from Stage 3
+
+The report-writer will:
+- Synthesize a timeline with UTC timestamps and source attribution
+- Perform root cause analysis with full causal chain
+- Map issues to specific Terraform/Helm files with line numbers
+- Draft concrete action items with code snippets
+- Include recurrence analysis from historian
+- Write the report to `docs/post-mortems/YYYY-MM-DD-<slug>.md`
+
+### Step 6: Wrap Up
+
+After the report-writer completes:
+
+1. **Tell the user** the report file path
+2. **Print the action items summary** grouped by priority (P1 first)
+3. **Suggest git commit**:
+   ```
+   cd /home/wizard/code/infra && git add docs/post-mortems/<filename> && git commit -m "post-mortem: <slug> [ci skip]"
+   ```
+4. **Ask if known-issues.md should be updated** if the root cause is a new persistent condition
+
+## Output Format
+
+Provide brief status updates as the pipeline progresses:
+- "Stage 1: Running triage scan..."
+- "Stage 1 complete: SEV{N} — {summary}. Spawning {N} specialist agents..."
+- "Stage 2 complete: {summary of findings}. Running historical analysis..."
+- "Stage 3 complete: {recurrence status}. Writing report..."
+- "Stage 4 complete: Report written to {path}"
--- a/.claude/agents/sev-historian.md
+++ b/.claude/agents/sev-historian.md
@ -0,0 +1,63 @@
+---
+name: sev-historian
+description: "Stage 3: Cross-reference current incident findings with historical post-mortems, known issues, and architectural patterns. Provides recurrence analysis and historical context."
+tools: Read, Bash, Grep, Glob
+model: sonnet
+---
+
+You are a historian agent for a homelab Kubernetes cluster's post-mortem pipeline. Your job is to cross-reference current incident findings with historical data to identify recurrence patterns and provide context.
+
+## Environment
+
+- **Post-mortems archive**: `/home/wizard/code/infra/docs/post-mortems/`
+- **Known issues**: `/home/wizard/code/infra/.claude/reference/known-issues.md`
+- **Patterns**: `/home/wizard/code/infra/.claude/reference/patterns.md`
+- **Service catalog**: `/home/wizard/code/infra/.claude/reference/service-catalog.md`
+
+## Inputs
+
+You will receive in your prompt:
+- **Triage output** from Stage 1 (severity, affected namespaces/domains, critical findings)
+- **Investigation findings** from Stage 2 specialist agents (root causes, symptoms, evidence)
+
+## Workflow
+
+1. **Read all post-mortems** in `docs/post-mortems/` — scan for incidents with the same root cause, same service, or same failure mode as the current incident
+2. **Read known-issues.md** — check if current findings match documented known issues (helps distinguish new vs recurring problems)
+3. **Read patterns.md** — check if root cause matches known architectural gotchas or anti-patterns
+4. **Read service-catalog.md** — understand service tiers and dependencies for cascade analysis. Map the dependency chain: which tier-1 (core) service failures cascade to tier-2/3/4 services?
+
+## NEVER Do
+
+- Never run kubectl or any cluster commands — you only read files
+- Never fabricate historical references — if there are no matching past incidents, say so
+
+## Output Format
+
+Produce output in exactly this structured format:
+
+```
+RECURRENCE_CHECK:
+- [YES|NO] Has this root cause occurred before?
+- If YES: link to past post-mortem file, what was done last time, did action items get completed?
+
+KNOWN_ISSUE_MATCH:
+- [YES|NO] Does this match a documented known issue?
+- If YES: which one, what's the documented workaround
+
+PATTERN_MATCH:
+- Relevant architectural patterns or gotchas from patterns.md
+- If none match, say "No matching patterns found"
+
+SERVICE_DEPENDENCIES:
+- Cascade chain: service A (tier) → service B (tier) → service C (tier)
+- Based on service-catalog.md tier classification
+
+HISTORICAL_CONTEXT:
+- Total post-mortems in archive: N
+- Related incidents: list with dates and file names
+- Trend: is this getting more or less frequent?
+- If first occurrence, say "First recorded incident of this type"
+```
+
+Keep output concise and structured. The report-writer agent will incorporate this into the final report.
--- a/.claude/agents/sev-report-writer.md
+++ b/.claude/agents/sev-report-writer.md
@ -0,0 +1,182 @@
+---
+name: sev-report-writer
+description: "Stage 4: Synthesize all upstream investigation data into a final post-mortem report with concrete, actionable items including file paths, draft alerts, and code snippets."
+tools: Read, Write, Bash, Grep, Glob
+model: opus
+---
+
+You are the report-writer for a homelab Kubernetes cluster's post-mortem pipeline. Your job is to synthesize ALL upstream data into a polished, actionable post-mortem report.
+
+## Environment
+
+- **Infra repo**: `/home/wizard/code/infra`
+- **Post-mortems archive**: `/home/wizard/code/infra/docs/post-mortems/`
+- **Post-mortem template**: `/home/wizard/code/infra/.claude/skills/post-mortem/template.md`
+- **Stacks directory**: `/home/wizard/code/infra/stacks/`
+- **Service catalog**: `/home/wizard/code/infra/.claude/reference/service-catalog.md`
+
+## Inputs
+
+You will receive in your prompt:
+- **Triage output** from Stage 1 (severity, affected namespaces/domains, timestamps, node status)
+- **Investigation findings** from Stage 2 specialist agents (root causes, symptoms, evidence)
+- **Historical context** from Stage 3 historian (recurrence, known issues, patterns, dependencies)
+
+## Key Improvements Over Basic Reports
+
+1. **Concrete action items** — every action item must include:
+   - Specific file path: `stacks/<stack>/main.tf:L42` (use Grep to find exact locations)
+   - Draft code snippet where possible (Prometheus alert YAML, Terraform resource block, Helm values change)
+   - Type: Terraform/Helm/Prometheus/UptimeKuma/Runbook
+
+2. **Proper UTC timeline** — all timestamps in `YYYY-MM-DDTHH:MM:SSZ` format, never relative ("47h ago")
+
+3. **Recurrence analysis section** — incorporate historian's findings on past incidents and pattern matches
+
+4. **Auto-severity** — use triage agent's classification with justification
+
+5. **Source attribution** — every timeline event and finding must reference which agent/tool provided the evidence
+
+## Workflow
+
+1. **Merge timeline**: Collect all timestamped events from triage + investigation agents into a single chronological list
+2. **Identify root cause**: The earliest causal event with supporting evidence chain
+3. **Map to infra files**: Use Grep/Glob to find the exact Terraform/Helm files for affected services
+4. **Draft action items**: For each issue, create concrete actions with file paths and code snippets
+5. **Write report** to `/home/wizard/code/infra/docs/post-mortems/YYYY-MM-DD-<slug>.md`
+6. **Link to GitHub Issue**: If a GitHub Issue number was provided in the prompt:
+   - Include `| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |` in the metadata table
+   - After writing the report, run these commands to link the postmortem to the issue:
+     ```bash
+     GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
+     # Add postmortem comment
+     curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" -H "Accept: application/vnd.github.v3+json" \
+       "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
+       -d "{\"body\": \"**Postmortem:** [View postmortem](https://viktorbarzin.github.io/infra/post-mortems/<slug>)\"}"
+     # Add postmortem-done label, remove postmortem-required
+     curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" -H "Accept: application/vnd.github.v3+json" \
+       "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" -d '{"labels":["postmortem-done"]}'
+     curl -s -X DELETE -H "Authorization: token $GITHUB_TOKEN" \
+       "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels/postmortem-required"
+     ```
+
+## NEVER Do
+
+- Never run kubectl or any cluster commands — you only read files and write the report
+- Never fabricate timeline events — evidence only, with source attribution
+- Never skip the recurrence analysis section even if historian found nothing (say "First recorded incident")
+- Never use relative timestamps
+
+## Report Template
+
+Write the report to `docs/post-mortems/YYYY-MM-DD-<slug>.md` using this template:
+
+```markdown
+# Post-Mortem: <Title>
+
+| Field | Value |
+|-------|-------|
+| **Date** | YYYY-MM-DD |
+| **Duration** | Xh Ym |
+| **Severity** | SEV1/SEV2/SEV3 |
+| **Classification** | Justification for severity level |
+| **Affected Services** | service1, service2 |
+| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |
+| **Status** | Draft |
+
+## Summary
+
+2-3 sentence overview of what happened, the impact, and the resolution.
+
+## Impact
+
+- **User-facing**: What users experienced
+- **Services affected**: Which services and how
+- **Duration**: How long the impact lasted
+- **Data loss**: Any data loss (or confirm none)
+
+## Timeline (UTC)
+
+| Time (UTC) | Event | Source |
+|------------|-------|--------|
+| YYYY-MM-DDTHH:MM:SSZ | Event description | agent-name / evidence |
+
+## Root Cause
+
+Technical explanation of what caused the incident, with evidence chain.
+Investigate the full causal chain — not just the symptom, but WHY the underlying condition existed.
+
+## Contributing Factors
+
+- Factor 1: explanation with evidence
+- Factor 2: explanation with evidence
+
+## Recurrence Analysis
+
+(From historian agent)
+- Previous incidents with same/similar root cause
+- Known issue matches
+- Pattern matches from architectural documentation
+- Trend analysis
+
+## Detection
+
+- **How detected**: Alert / user report / manual check / post-mortem scan
+- **Time to detect**: Xm from start
+- **Gap analysis**: What should have caught this earlier
+
+## Resolution
+
+What was done (or needs to be done) to resolve the incident.
+
+## Action Items
+
+### Preventive (stop recurrence)
+
+| Priority | Action | File | Draft Change |
+|----------|--------|------|-------------|
+| P1 | Description | `stacks/X/main.tf:LN` | ```hcl\nresource snippet\n``` |
+
+### Detective (catch faster)
+
+| Priority | Action | Type | Draft Alert/Monitor |
+|----------|--------|------|-------------------|
+| P2 | Description | Prometheus/UptimeKuma | ```yaml\nalert rule\n``` |
+
+### Mitigative (reduce blast radius)
+
+| Priority | Action | File | Draft Change |
+|----------|--------|------|-------------|
+| P3 | Description | `stacks/X/main.tf:LN` | ```hcl\nresource snippet\n``` |
+
+## Lessons Learned
+
+- **Went well**: What worked during detection/response
+- **Went poorly**: What made things worse or slower
+- **Got lucky**: Things that could have made this much worse
+
+## Raw Investigation Data
+
+<details>
+<summary>Triage output</summary>
+
+(paste triage output)
+
+</details>
+
+<details>
+<summary>Investigation agent findings</summary>
+
+(paste each agent's output in separate sub-sections)
+
+</details>
+
+<details>
+<summary>Historical context</summary>
+
+(paste historian output)
+
+</details>
+```
+
+After writing the report, output the file path so the orchestrator can inform the user.
--- a/.claude/agents/sev-triage.md
+++ b/.claude/agents/sev-triage.md
@ -0,0 +1,58 @@
+---
+name: sev-triage
+description: "Stage 1: Fast cluster scan and severity classification for the post-mortem pipeline. Produces structured triage output for downstream agents."
+tools: Read, Bash, Grep, Glob
+model: haiku
+---
+
+You are a fast triage agent for a homelab Kubernetes cluster. Your job is to run a quick scan (~60 seconds) and produce structured output for downstream investigation agents.
+
+## Environment
+
+- **Kubeconfig**: `/home/wizard/code/infra/config`
+- **Infra repo**: `/home/wizard/code/infra`
+- **Context script**: `/home/wizard/code/infra/.claude/scripts/sev-context.sh`
+
+## Workflow
+
+1. **Run context script**: Execute `bash /home/wizard/code/infra/.claude/scripts/sev-context.sh` to get structured cluster context
+2. **Classify severity** based on findings:
+   - **SEV1**: Critical path down (Traefik, Authentik, PostgreSQL, DNS, Cloudflared) OR >50% of pods unhealthy
+   - **SEV2**: Partial degradation, non-critical services down, or single critical service degraded but redundant
+   - **SEV3**: Minor issues, cosmetic, single non-critical pod restart
+3. **Identify affected domains** to inform which specialist agents should be spawned:
+   - `storage` — NFS, PVC, CSI driver issues
+   - `database` — MySQL, PostgreSQL, CNPG, replication
+   - `networking` — DNS, MetalLB, CoreDNS, connectivity
+   - `auth` — Authentik, TLS certs, CrowdSec
+   - `compute` — Node conditions, OOM, resource pressure
+   - `deploy` — Recent rollouts, image pull failures
+4. **Convert all timestamps to UTC** — never use relative times like "47h ago". Use the pod's `.status.startTime` or event `.lastTimestamp`.
+5. **Identify investigation hints** — suggest which specialist agents should be spawned based on symptoms.
+
+## NEVER Do
+
+- Never run `kubectl apply`, `patch`, `delete`, or any mutating commands
+- Never spend more than ~60 seconds investigating — you are a quick scan, not deep investigation
+
+## Output Format
+
+You MUST produce output in exactly this structured format:
+
+```
+SEVERITY: SEV1|SEV2|SEV3
+AFFECTED_NAMESPACES: ns1, ns2, ns3
+AFFECTED_DOMAINS: storage, database, networking, auth, compute, deploy
+TIME_WINDOW: YYYY-MM-DDTHH:MM — YYYY-MM-DDTHH:MM (UTC)
+TRIGGER: deploy|config-change|upstream|hardware|unknown
+NODE_STATUS: node1=Ready, node2=Ready, ...
+CRITICAL_FINDINGS:
+- [YYYY-MM-DDTHH:MM:SSZ] finding 1
+- [YYYY-MM-DDTHH:MM:SSZ] finding 2
+INVESTIGATION_HINTS:
+- Suggest spawning: platform-engineer (reason)
+- Suggest spawning: dba (reason)
+- Suggest spawning: network-engineer (reason)
+```
+
+Keep the output concise and machine-readable. Downstream agents will parse this.
--- a/.claude/skills/post-mortem/skill.md
+++ b/.claude/skills/post-mortem/skill.md
@ -33,7 +33,30 @@ Generate a structured post-mortem document after an incident mitigation session.
 4. **Update index**: Add an entry to `docs/post-mortems/index.html`
   - Add a new card in the incidents grid with date, severity tag, title, description

-5. **Commit and push**:
+5. **Link to GitHub Issue** (if an issue exists for this incident):
+   - Fill in the `Issue` field in the template metadata table with `[#N](https://github.com/ViktorBarzin/infra/issues/N)`
+   - Add a comment to the GitHub Issue linking the postmortem:
+     ```bash
+     GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
+     curl -s -X POST \
+       -H "Authorization: token $GITHUB_TOKEN" \
+       -H "Accept: application/vnd.github.v3+json" \
+       "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
+       -d '{"body": "**Postmortem:** [View postmortem](https://viktorbarzin.github.io/infra/post-mortems/<YYYY-MM-DD>-<slug>)"}'
+     ```
+   - Add the `postmortem-done` label and remove `postmortem-required`:
+     ```bash
+     curl -s -X POST \
+       -H "Authorization: token $GITHUB_TOKEN" \
+       "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" \
+       -d '{"labels": ["postmortem-done"]}'
+     curl -s -X DELETE \
+       -H "Authorization: token $GITHUB_TOKEN" \
+       "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels/postmortem-required"
+     ```
+   - If no issue exists, create one with labels `incident`, `sev<N>`, `postmortem-done`
+
+6. **Commit and push**:
   ```
   git add docs/post-mortems/<file>.md docs/post-mortems/index.html
   git commit -m "docs: post-mortem for <date> <title> [ci skip]"
--- a/.claude/skills/post-mortem/template.md
+++ b/.claude/skills/post-mortem/template.md
@ -6,6 +6,7 @@
 | **Duration** | <DURATION> |
 | **Severity** | <SEV1/SEV2/SEV3> |
 | **Affected Services** | <COUNT> pods across <COUNT> namespaces |
+| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |
 | **Status** | Draft |

 ## Summary
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@ -0,0 +1,5 @@
+blank_issues_enabled: true
+contact_links:
+  - name: Service Status
+    url: https://status.viktorbarzin.me
+    about: Check current service status and active incidents
--- a/.github/ISSUE_TEMPLATE/outage-report.yml
+++ b/.github/ISSUE_TEMPLATE/outage-report.yml
@ -0,0 +1,37 @@
+name: Report an Outage
+description: Report a service that appears to be down or degraded
+labels: ["user-report"]
+body:
+  - type: dropdown
+    id: service
+    attributes:
+      label: Affected Service
+      description: Which service is affected?
+      options:
+        - Nextcloud
+        - Immich
+        - Vaultwarden
+        - Grafana
+        - Plex / Jellyfin
+        - Mail
+        - DNS
+        - VPN / Tailscale
+        - Website / Blog
+        - Music (Navidrome / Freedify)
+        - Other
+    validations:
+      required: true
+  - type: textarea
+    id: description
+    attributes:
+      label: What's happening?
+      description: Describe what you're seeing. Include error messages, when it started, etc.
+      placeholder: "e.g., Getting 502 errors when trying to access Nextcloud since about 3pm"
+    validations:
+      required: true
+  - type: input
+    id: contact
+    attributes:
+      label: Contact (optional)
+      description: How can we reach you with updates?
+      placeholder: Email, Telegram handle, etc.
--- a/stacks/status-page/index.html
+++ b/stacks/status-page/index.html
@ -0,0 +1,356 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="utf-8">
+<meta name="viewport" content="width=device-width, initial-scale=1">
+<title>Service Status</title>
+<link rel="preconnect" href="https://fonts.googleapis.com">
+<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
+<style>
+:root {
+  --bg: #ffffff; --surface: #f8fafb; --fg: #1a202c; --fg2: #64748b; --fg3: #94a3b8;
+  --border: #e2e8f0; --hover: #f1f5f9;
+  --green: #22c55e; --red: #ef4444; --amber: #f59e0b; --indigo: #6366f1;
+  --green-bg: #f0fdf4; --red-bg: #fef2f2; --amber-bg: #fffbeb;
+  --sans: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
+  --mono: 'JetBrains Mono', 'SF Mono', 'Fira Code', monospace;
+}
+@media (prefers-color-scheme: dark) {
+  :root {
+    --bg: #0f172a; --surface: #1e293b; --fg: #e2e8f0; --fg2: #94a3b8; --fg3: #64748b;
+    --border: #334155; --hover: #1e293b;
+    --green: #4ade80; --red: #f87171; --amber: #fbbf24; --indigo: #818cf8;
+    --green-bg: #052e16; --red-bg: #450a0a; --amber-bg: #451a03;
+  }
+}
+*, *::before, *::after { margin: 0; padding: 0; box-sizing: border-box; }
+body { font-family: var(--sans); background: var(--bg); color: var(--fg); line-height: 1.5; -webkit-font-smoothing: antialiased; font-size: 14px; }
+.wrap { max-width: 720px; margin: 0 auto; padding: 32px 20px 64px; }
+
+header { margin-bottom: 28px; }
+header h1 { font-size: 20px; font-weight: 600; margin-bottom: 2px; }
+.ts { color: var(--fg3); font-family: var(--mono); font-size: 12px; }
+
+.hero { display: flex; align-items: center; gap: 10px; padding: 16px 20px; border-radius: 10px; margin-bottom: 24px; font-weight: 600; font-size: 15px; color: #fff; }
+.hero-ok { background: var(--green); }
+.hero-warn { background: var(--amber); color: var(--fg); }
+.hero-down { background: var(--red); }
+.hero-dot { width: 10px; height: 10px; border-radius: 50%; background: rgba(255,255,255,0.5); flex-shrink: 0; }
+.hero-ok .hero-dot { animation: pulse 2s ease-in-out infinite; }
+@keyframes pulse { 0%, 100% { transform: scale(1); opacity: 0.5; } 50% { transform: scale(1.4); opacity: 1; } }
+
+.stale { background: var(--amber-bg); color: var(--amber); padding: 10px 16px; border-radius: 8px; font-size: 13px; margin-bottom: 16px; display: none; border: 1px solid color-mix(in srgb, var(--amber) 20%, transparent); }
+
+/* Incidents */
+.incidents { margin-bottom: 24px; }
+.inc-header { font-size: 15px; font-weight: 600; margin-bottom: 10px; display: flex; align-items: center; gap: 8px; }
+.inc-header .cnt { font-size: 12px; color: var(--fg3); font-weight: 400; }
+.resolved-header { margin-top: 20px; }
+
+.inc { background: var(--surface); border: 1px solid var(--border); border-radius: 10px; margin-bottom: 10px; overflow: hidden; }
+.inc-top { padding: 14px 16px; cursor: pointer; display: flex; align-items: flex-start; gap: 10px; user-select: none; }
+.inc-top:hover { background: var(--hover); }
+
+.sev { font-family: var(--mono); font-size: 11px; font-weight: 600; padding: 2px 8px; border-radius: 4px; flex-shrink: 0; text-transform: uppercase; margin-top: 2px; }
+.sev-1 { background: var(--red-bg); color: var(--red); border: 1px solid color-mix(in srgb, var(--red) 30%, transparent); }
+.sev-2 { background: var(--amber-bg); color: var(--amber); border: 1px solid color-mix(in srgb, var(--amber) 30%, transparent); }
+.sev-3 { background: var(--surface); color: var(--fg2); border: 1px solid var(--border); }
+
+.inc-info { flex: 1; min-width: 0; }
+.inc-title { font-size: 14px; font-weight: 600; }
+.inc-meta { font-size: 12px; color: var(--fg3); margin-top: 2px; display: flex; gap: 12px; flex-wrap: wrap; }
+.inc-services { display: flex; gap: 4px; flex-wrap: wrap; margin-top: 6px; }
+.inc-svc { font-size: 11px; padding: 1px 8px; border-radius: 4px; background: var(--hover); border: 1px solid var(--border); color: var(--fg2); }
+
+.inc-tl { border-top: 1px solid var(--border); padding: 12px 16px; display: none; }
+.inc.open .inc-tl { display: block; }
+.tl-entry { position: relative; padding-left: 20px; padding-bottom: 14px; border-left: 2px solid var(--border); margin-left: 4px; }
+.tl-entry:last-child { padding-bottom: 0; }
+.tl-entry::before { content: ''; position: absolute; left: -5px; top: 4px; width: 8px; height: 8px; border-radius: 50%; background: var(--fg3); border: 2px solid var(--surface); }
+.tl-time { font-family: var(--mono); font-size: 11px; color: var(--fg3); }
+.tl-status { font-size: 12px; font-weight: 600; color: var(--fg); display: inline; }
+.tl-body { font-size: 13px; color: var(--fg2); margin-top: 2px; white-space: pre-wrap; word-break: break-word; }
+
+.inc-links { margin-top: 10px; font-size: 12px; display: flex; gap: 14px; }
+.inc-links a { color: var(--indigo); text-decoration: none; }
+.inc-links a:hover { text-decoration: underline; }
+
+.inc-resolved { opacity: 0.7; }
+.inc-resolved:hover { opacity: 1; }
+
+.sev-ur { background: color-mix(in srgb, var(--indigo) 15%, transparent); color: var(--indigo); border: 1px solid color-mix(in srgb, var(--indigo) 30%, transparent); }
+
+.report-bar { display: flex; align-items: center; justify-content: space-between; gap: 12px; padding: 12px 16px; border-radius: 10px; margin-bottom: 24px; background: var(--surface); border: 1px solid var(--border); }
+.report-bar span { font-size: 13px; color: var(--fg2); }
+.report-btn { font-family: var(--sans); font-size: 12px; font-weight: 600; padding: 6px 16px; border-radius: 6px; background: var(--indigo); color: #fff; text-decoration: none; white-space: nowrap; transition: opacity 0.15s; }
+.report-btn:hover { opacity: 0.85; }
+
+.bar { display: flex; gap: 6px; margin-bottom: 20px; flex-wrap: wrap; align-items: center; }
+.bar label { font-size: 11px; color: var(--fg3); text-transform: uppercase; letter-spacing: 0.06em; font-weight: 500; }
+.fbtn { font-family: var(--sans); font-size: 12px; padding: 4px 12px; border-radius: 6px; border: 1px solid var(--border); background: transparent; color: var(--fg2); cursor: pointer; font-weight: 500; }
+.fbtn:hover { border-color: var(--fg3); color: var(--fg); }
+.fbtn.on { background: var(--fg); color: var(--bg); border-color: var(--fg); }
+.bar select { font-family: var(--sans); font-size: 12px; padding: 4px 8px; border-radius: 6px; border: 1px solid var(--border); background: var(--bg); color: var(--fg); cursor: pointer; }
+
+.g { background: var(--surface); border: 1px solid var(--border); border-radius: 10px; margin-bottom: 12px; overflow: hidden; }
+.g.hide { display: none; }
+.gh { padding: 14px 16px; cursor: pointer; display: flex; align-items: center; justify-content: space-between; user-select: none; }
+.gh:hover { background: var(--hover); }
+.gt { font-weight: 600; font-size: 13px; display: flex; align-items: center; gap: 8px; }
+.gt .n { font-weight: 400; color: var(--fg3); font-size: 12px; }
+.chev { font-size: 10px; color: var(--fg3); transition: transform 0.15s; display: inline-block; }
+.g.shut .chev { transform: rotate(-90deg); }
+.g.shut .gb { display: none; }
+.gs { font-family: var(--mono); font-size: 12px; display: flex; gap: 8px; }
+
+.gb { border-top: 1px solid var(--border); }
+.colh { display: flex; align-items: center; padding: 6px 16px; gap: 8px; }
+.colh-sp { width: 8px; flex-shrink: 0; }
+.colh-n { flex: 1; font-size: 10px; color: var(--fg3); text-transform: uppercase; letter-spacing: 0.08em; font-weight: 500; }
+.colh-v { display: flex; gap: 2px; }
+.colh-l { width: 52px; text-align: right; font-size: 10px; color: var(--fg3); text-transform: uppercase; letter-spacing: 0.06em; font-weight: 500; }
+
+.row { display: flex; align-items: center; padding: 8px 16px; gap: 8px; border-top: 1px solid var(--border); }
+.row:first-of-type { border-top: none; }
+.row:hover { background: var(--hover); }
+.row.hide { display: none; }
+.d { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; }
+.d-up { background: var(--green); }
+.d-dn { background: var(--red); box-shadow: 0 0 0 3px rgba(239,68,68,0.15); }
+.d-pn { background: var(--amber); }
+.mn { flex: 1; font-size: 13px; font-weight: 500; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }
+.mn a { color: inherit; text-decoration: none; border-bottom: 1px solid transparent; transition: border-color 0.15s; }
+.mn a:hover { color: var(--indigo); border-bottom-color: var(--indigo); }
+.uv { display: flex; gap: 2px; font-family: var(--mono); font-size: 12px; }
+.uv span { width: 52px; text-align: right; color: var(--fg3); }
+.uv .ok { color: var(--green); }
+.uv .wn { color: var(--amber); }
+.uv .bd { color: var(--red); }
+
+footer { color: var(--fg3); font-size: 11px; margin-top: 32px; padding-top: 16px; border-top: 1px solid var(--border); text-align: center; }
+.ld { text-align: center; padding: 60px 0; color: var(--fg3); }
+.err { text-align: center; padding: 40px 0; color: var(--red); }
+
+@media (max-width: 480px) {
+  .wrap { padding: 20px 14px 40px; }
+  .uv span, .colh-l { width: 42px; font-size: 11px; }
+  .row, .colh { padding-left: 12px; padding-right: 12px; }
+  .gh { padding: 12px; }
+  .inc-top { padding: 12px; }
+}
+</style>
+</head>
+<body>
+<div class="wrap">
+  <header>
+    <h1>Service Status</h1>
+    <div class="ts" id="ts"></div>
+  </header>
+  <div class="stale" id="stale"></div>
+  <div class="hero" id="hero"></div>
+  <div class="report-bar">
+    <span>Something not working?</span>
+    <a href="https://github.com/ViktorBarzin/infra/issues/new?template=outage-report.yml" target="_blank" rel="noopener" class="report-btn">Report an Outage</a>
+  </div>
+  <div id="incidents"></div>
+  <div class="bar" id="bar" style="display:none">
+    <label>Show:</label>
+    <button class="fbtn on" data-f="all">All</button>
+    <button class="fbtn" data-f="up">Up</button>
+    <button class="fbtn" data-f="down">Down</button>
+    <span style="flex:1"></span>
+    <label>Sort:</label>
+    <select id="ss">
+      <option value="status">Status</option>
+      <option value="name">Name</option>
+      <option value="u-asc">Uptime asc</option>
+      <option value="u-desc">Uptime desc</option>
+    </select>
+  </div>
+  <div id="gs"><div class="ld">Loading&hellip;</div></div>
+  <footer>Updated every 5 minutes &middot; Powered by Uptime Kuma &middot; <a href="https://github.com/ViktorBarzin/infra/issues" style="color:var(--fg3)">Report issues</a></footer>
+</div>
+<script>
+(function(){
+  var U='status.json',S=6e5,D=null,F='all',O='status';
+
+  function esc(s){var d=document.createElement('div');d.textContent=s||'';return d.innerHTML}
+  function ago(d){var s=Math.floor((Date.now()-d)/1e3);if(s<0)s=0;return s<60?s+'s ago':s<3600?Math.floor(s/60)+'m ago':s<86400?Math.floor(s/3600)+'h ago':Math.floor(s/86400)+'d ago'}
+  function dur(start,end){var m=Math.floor((end-start)/6e4);if(m<1)return '<1m';return m<60?m+'m':Math.floor(m/60)+'h '+m%60+'m'}
+  function uc(p){return p==null?'':p>=99?'ok':p>=95?'wn':'bd'}
+  function pf(p){return p==null?'\u2014':p.toFixed(1)+'%'}
+
+  function srt(a){return a.slice().sort(function(x,y){
+    if(O==='name')return x.name.localeCompare(y.name);
+    if(O==='u-asc'){var xa=x.uptime_24h==null?101:x.uptime_24h,ya=y.uptime_24h==null?101:y.uptime_24h;return xa-ya}
+    if(O==='u-desc'){var xd=x.uptime_24h==null?-1:x.uptime_24h,yd=y.uptime_24h==null?-1:y.uptime_24h;return yd-xd}
+    var o={down:0,pending:1,up:2},ao=o[x.status]!=null?o[x.status]:1,bo=o[y.status]!=null?o[y.status]:1;
+    return ao!==bo?ao-bo:x.name.localeCompare(y.name);
+  })}
+  function fm(m){return F==='all'||(F==='up'?m.status==='up':m.status!=='up')}
+
+  function buildIncident(inc,resolved){
+    var isReport=inc.type==='user-report';
+    var sevNum=isReport?0:inc.severity==='sev1'?1:inc.severity==='sev2'?2:3;
+    var created=new Date(inc.created_at);
+    var end=resolved&&inc.closed_at?new Date(inc.closed_at):new Date();
+
+    var el=document.createElement('div');
+    el.className='inc'+(resolved?' inc-resolved':'');
+
+    // Top bar
+    var top=document.createElement('div');
+    top.className='inc-top';
+    var badgeHtml=isReport
+      ?'<div class="sev sev-ur">REPORT</div>'
+      :'<div class="sev sev-'+sevNum+'">SEV'+sevNum+'</div>';
+    var html=badgeHtml;
+    html+='<div class="inc-info"><div class="inc-title">'+esc(inc.title)+'</div>';
+    html+='<div class="inc-meta"><span>'+ago(created)+'</span>';
+    if(!isReport)html+='<span>'+dur(created,end)+'</span>';
+    if(resolved)html+='<span style="color:var(--green)">Resolved</span>';
+    html+='</div>';
+    if(inc.affected_services&&inc.affected_services.length){
+      html+='<div class="inc-services">';
+      for(var i=0;i<inc.affected_services.length;i++)html+='<span class="inc-svc">'+esc(inc.affected_services[i])+'</span>';
+      html+='</div>';
+    }
+    html+='</div><span class="chev">&#9656;</span>';
+    top.innerHTML=html;
+    top.onclick=function(){el.classList.toggle('open')};
+    el.appendChild(top);
+
+    // Timeline
+    var tl=document.createElement('div');
+    tl.className='inc-tl';
+    if(inc.timeline&&inc.timeline.length){
+      for(var i=inc.timeline.length-1;i>=0;i--){
+        var te=inc.timeline[i];
+        var entry=document.createElement('div');
+        entry.className='tl-entry';
+        entry.innerHTML='<div class="tl-time">'+new Date(te.timestamp).toLocaleString()+'</div>'
+          +'<div class="tl-status">'+esc(te.status)+'</div>'
+          +'<div class="tl-body">'+esc(te.body)+'</div>';
+        tl.appendChild(entry);
+      }
+    }
+    // Links
+    var links=document.createElement('div');
+    links.className='inc-links';
+    if(inc.postmortem)links.innerHTML+='<a href="'+esc(inc.postmortem)+'" target="_blank" rel="noopener">Postmortem</a>';
+    links.innerHTML+='<a href="'+esc(inc.url)+'" target="_blank" rel="noopener">View on GitHub &rarr;</a>';
+    tl.appendChild(links);
+    el.appendChild(tl);
+
+    return el;
+  }
+
+  function render(data){
+    D=data;
+    var t=new Date(data.last_updated),age=Date.now()-t.getTime();
+    document.getElementById('ts').textContent=ago(t);
+    var st=document.getElementById('stale');
+    if(age>S){st.textContent='Data is '+Math.floor(age/6e4)+' minutes old. Monitoring may be unreachable.';st.style.display='block'}else st.style.display='none';
+
+    var gs={};
+    for(var gn in data.groups){var a=data.groups[gn].filter(function(m){return m.status!=='paused'});if(a.length)gs[gn]=a}
+
+    var tu=0,td=0;
+    for(var g in gs)for(var i=0;i<gs[g].length;i++)gs[g][i].status==='up'?tu++:td++;
+
+    // Incidents
+    var inc=data.incidents||{active:[],resolved:[]};
+    var incEl=document.getElementById('incidents');
+    incEl.innerHTML='';
+
+    // Hero — incidents take priority
+    var h=document.getElementById('hero');
+    if(inc.active.length>0){
+      var maxSev=3;
+      for(var si=0;si<inc.active.length;si++){
+        var s=inc.active[si].severity==='sev1'?1:inc.active[si].severity==='sev2'?2:3;
+        if(s<maxSev)maxSev=s;
+      }
+      if(maxSev===1){h.className='hero hero-down';h.innerHTML='<div class="hero-dot"></div>Active Incident \u2014 SEV1'}
+      else{h.className='hero hero-warn';h.innerHTML='<div class="hero-dot"></div>'+inc.active.length+' Active Incident'+(inc.active.length>1?'s':'')}
+    }else if(!td){h.className='hero hero-ok';h.innerHTML='<div class="hero-dot"></div>All Systems Operational'}
+    else if(td<=3){h.className='hero hero-warn';h.innerHTML='<div class="hero-dot"></div>'+td+' service'+(td>1?'s':'')+' experiencing issues'}
+    else{h.className='hero hero-down';h.innerHTML='<div class="hero-dot"></div>'+td+' services down'}
+
+    // Render active incidents
+    if(inc.active.length>0){
+      var ah=document.createElement('div');
+      ah.className='inc-header';
+      ah.innerHTML='Active Incidents <span class="cnt">'+inc.active.length+'</span>';
+      incEl.appendChild(ah);
+      for(var ai=0;ai<inc.active.length;ai++)incEl.appendChild(buildIncident(inc.active[ai],false));
+    }
+
+    // Render user reports
+    var reports=inc.user_reports||[];
+    if(reports.length>0){
+      var urh=document.createElement('div');
+      urh.className='inc-header';
+      urh.innerHTML='User Reports <span class="cnt">'+reports.length+'</span>';
+      incEl.appendChild(urh);
+      for(var ui=0;ui<reports.length;ui++)incEl.appendChild(buildIncident(reports[ui],false));
+    }
+
+    // Render resolved incidents
+    if(inc.resolved.length>0){
+      var rh=document.createElement('div');
+      rh.className='inc-header resolved-header';
+      rh.innerHTML='Recently Resolved <span class="cnt">last 7 days</span>';
+      incEl.appendChild(rh);
+      for(var ri=0;ri<inc.resolved.length;ri++)incEl.appendChild(buildIncident(inc.resolved[ri],true));
+    }
+
+    // Monitor groups
+    document.getElementById('bar').style.display='flex';
+    var c=document.getElementById('gs');c.innerHTML='';
+    var ks=Object.keys(gs).sort(function(a,b){return gs[b].length-gs[a].length});
+
+    for(var ki=0;ki<ks.length;ki++){
+      var gn=ks[ki],ms=gs[gn],so=srt(ms),vc=so.filter(fm).length;
+      var ge=document.createElement('div');ge.className='g'+(vc?'':' hide');
+
+      var up=ms.filter(function(m){return m.status==='up'}).length,dn=ms.length-up;
+      var hd=document.createElement('div');hd.className='gh';
+      hd.innerHTML='<div class="gt"><span class="chev">&#9656;</span>'+gn+' <span class="n">'+ms.length+'</span></div><div class="gs">'+(dn?'<span style="color:var(--red)">'+dn+' down</span>':'')+'<span style="color:var(--green)">'+up+' up</span></div>';
+      hd.onclick=function(){this.parentElement.classList.toggle('shut')};
+
+      var bd=document.createElement('div');bd.className='gb';
+      var ch=document.createElement('div');ch.className='colh';
+      ch.innerHTML='<div class="colh-sp"></div><div class="colh-n">Service</div><div class="colh-v"><div class="colh-l">24h</div><div class="colh-l">7d</div><div class="colh-l">30d</div></div>';
+      bd.appendChild(ch);
+
+      for(var mi=0;mi<so.length;mi++){
+        var m=so[mi],dc=m.status==='up'?'d-up':m.status==='pending'?'d-pn':'d-dn';
+        var r=document.createElement('div');r.className='row'+(fm(m)?'':' hide');
+        var nameHtml=m.name;
+        if(m.url){nameHtml='<a href="'+m.url+'" target="_blank" rel="noopener">'+m.name+'</a>'}
+        r.innerHTML='<div class="d '+dc+'"></div><div class="mn">'+nameHtml+'</div><div class="uv"><span class="'+uc(m.uptime_24h)+'">'+pf(m.uptime_24h)+'</span><span class="'+uc(m.uptime_7d)+'">'+pf(m.uptime_7d)+'</span><span class="'+uc(m.uptime_30d)+'">'+pf(m.uptime_30d)+'</span></div>';
+        bd.appendChild(r);
+      }
+      ge.appendChild(hd);ge.appendChild(bd);c.appendChild(ge);
+    }
+  }
+
+  document.addEventListener('click',function(e){
+    if(!e.target.classList.contains('fbtn'))return;
+    var bs=document.querySelectorAll('.fbtn');for(var i=0;i<bs.length;i++)bs[i].classList.remove('on');
+    e.target.classList.add('on');F=e.target.getAttribute('data-f');if(D)render(D);
+  });
+  document.getElementById('ss').onchange=function(){O=this.value;if(D)render(D)};
+
+  function load(){
+    fetch(U+'?t='+Date.now()).then(function(r){if(!r.ok)throw 0;return r.json()}).then(render)
+    .catch(function(){document.getElementById('gs').innerHTML='<div class="err">Could not load status data.</div>';
+      var h=document.getElementById('hero');h.className='hero hero-down';h.innerHTML='<div class="hero-dot"></div>Status Unavailable'});
+  }
+  load();setInterval(load,6e4);
+})();
+</script>
+</body>
+</html>
--- a/stacks/status-page/terragrunt.hcl
+++ b/stacks/status-page/terragrunt.hcl
@ -0,0 +1,8 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+dependency "infra" {
+  config_path  = "../infra"
+  skip_outputs = true
+}