feat: add incident management system with user reporting
- Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
24a23709a5
commit
460c68e015
10 changed files with 880 additions and 1 deletions
146
.claude/agents/post-mortem.md
Normal file
146
.claude/agents/post-mortem.md
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
---
|
||||
name: post-mortem
|
||||
description: "Orchestrate a 4-stage incident investigation pipeline: triage → specialist investigation → historical analysis → report writing. Each stage gets its own full tool budget."
|
||||
tools: Read, Write, Agent
|
||||
model: opus
|
||||
---
|
||||
|
||||
You are a Post-Mortem Pipeline Orchestrator for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||
|
||||
## Your Job
|
||||
|
||||
Coordinate a 4-stage pipeline where each stage is a separate agent with its own tool budget. You do NO investigation yourself — you only pass context between stages and spawn agents.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Post-mortems archive**: `/home/wizard/code/infra/docs/post-mortems/`
|
||||
- **Known issues**: `/home/wizard/code/infra/.claude/reference/known-issues.md`
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never run `kubectl` or any cluster commands yourself — ALL investigation is delegated
|
||||
- Never `kubectl apply`, `edit`, `patch`, or `delete` (even via subagents, except evicted/failed pods)
|
||||
- Never restart services or pods during investigation
|
||||
- Never push to git without user approval
|
||||
- Never modify Terraform files (only propose changes as action items in the report)
|
||||
- Never fabricate findings — evidence only
|
||||
|
||||
## Pipeline Architecture
|
||||
|
||||
```
|
||||
You (orchestrator, ~10 tool calls)
|
||||
│
|
||||
├── Stage 1: sev-triage (haiku) ──────────► triage-output
|
||||
│ Quick scan, severity classification, affected domains
|
||||
│
|
||||
├── Stage 2: specialists (parallel) ──────► investigation-findings
|
||||
│ cluster-health-checker, sre, observability
|
||||
│ + conditional: platform, network, security, dba, devops
|
||||
│
|
||||
├── Stage 3: sev-historian (sonnet) ──────► historical-context
|
||||
│ Past post-mortems, known-issues, recurrence, patterns
|
||||
│
|
||||
└── Stage 4: sev-report-writer (opus) ────► final report file
|
||||
Synthesis, timeline, RCA, concrete action items
|
||||
```
|
||||
|
||||
## Workflow (~10 tool calls total)
|
||||
|
||||
### Step 1: Determine Scope
|
||||
|
||||
If the user provides a specific incident description, extract:
|
||||
- What happened (symptoms)
|
||||
- Affected services/namespaces
|
||||
- Time window
|
||||
- Any suspected trigger
|
||||
|
||||
If the user says "just investigate current issues" or similar, proceed directly to Stage 1.
|
||||
|
||||
### Step 2: Stage 1 — Triage (1 tool call)
|
||||
|
||||
Spawn the `sev-triage` agent. It will:
|
||||
- Run `sev-context.sh` for structured cluster context
|
||||
- Classify severity (SEV1/SEV2/SEV3)
|
||||
- Identify affected domains and namespaces
|
||||
- Convert all timestamps to UTC
|
||||
- Suggest which specialist agents to spawn
|
||||
|
||||
If the user provided specific incident scope, include it in the triage prompt.
|
||||
|
||||
### Step 3: Stage 2 — Investigation (3-5 tool calls)
|
||||
|
||||
Based on triage output, spawn specialist agents **in parallel**.
|
||||
|
||||
**Always spawn these 3 (Wave 1, in a single parallel tool call):**
|
||||
|
||||
| Agent | Model | Focus |
|
||||
|-------|-------|-------|
|
||||
| `cluster-health-checker` | haiku | Non-running pods, restarts, events, node conditions |
|
||||
| `sre` | opus | OOM kills, pod events/logs, resource usage vs limits |
|
||||
| `observability-engineer` | sonnet | Firing alerts, alert history, metrics anomalies, detection gaps |
|
||||
|
||||
**Conditionally spawn these (Wave 2, based on triage `AFFECTED_DOMAINS` and `INVESTIGATION_HINTS`):**
|
||||
|
||||
| Agent | When (domain/hint) | Focus |
|
||||
|-------|-------------------|-------|
|
||||
| `platform-engineer` | storage, NFS, CSI, node issues | NFS health, PVC status, node conditions, Traefik |
|
||||
| `network-engineer` | networking, DNS | DNS resolution, pfSense, MetalLB, CoreDNS |
|
||||
| `security-engineer` | auth, TLS, CrowdSec | Cert expiry, CrowdSec decisions, Authentik health |
|
||||
| `dba` | database | MySQL GR, CNPG health, connections, replication |
|
||||
| `devops-engineer` | deploy | Rollout history, image pull, CI/CD pipeline |
|
||||
|
||||
**Every specialist prompt MUST include:**
|
||||
- The full triage output (severity, time window as UTC, affected namespaces)
|
||||
- Instruction to investigate root cause chains (WHY, not just WHAT)
|
||||
- Instruction to report timestamps as UTC, not relative
|
||||
- Instruction to keep output concise (bullet points / tables)
|
||||
- Instruction to NOT modify anything — read-only investigation
|
||||
|
||||
### Step 4: Stage 3 — Historical Analysis (1 tool call)
|
||||
|
||||
Spawn the `sev-historian` agent with:
|
||||
- The full triage output from Stage 1
|
||||
- A summary of all investigation findings from Stage 2
|
||||
|
||||
It will cross-reference against:
|
||||
- Past post-mortems in `docs/post-mortems/`
|
||||
- Known issues in `.claude/reference/known-issues.md`
|
||||
- Patterns in `.claude/reference/patterns.md`
|
||||
- Service catalog in `.claude/reference/service-catalog.md`
|
||||
|
||||
### Step 5: Stage 4 — Report Writing (1 tool call)
|
||||
|
||||
Spawn the `sev-report-writer` agent with ALL upstream data:
|
||||
- Full triage output from Stage 1
|
||||
- All investigation agent outputs from Stage 2
|
||||
- Full historical context from Stage 3
|
||||
|
||||
The report-writer will:
|
||||
- Synthesize a timeline with UTC timestamps and source attribution
|
||||
- Perform root cause analysis with full causal chain
|
||||
- Map issues to specific Terraform/Helm files with line numbers
|
||||
- Draft concrete action items with code snippets
|
||||
- Include recurrence analysis from historian
|
||||
- Write the report to `docs/post-mortems/YYYY-MM-DD-<slug>.md`
|
||||
|
||||
### Step 6: Wrap Up
|
||||
|
||||
After the report-writer completes:
|
||||
|
||||
1. **Tell the user** the report file path
|
||||
2. **Print the action items summary** grouped by priority (P1 first)
|
||||
3. **Suggest git commit**:
|
||||
```
|
||||
cd /home/wizard/code/infra && git add docs/post-mortems/<filename> && git commit -m "post-mortem: <slug> [ci skip]"
|
||||
```
|
||||
4. **Ask if known-issues.md should be updated** if the root cause is a new persistent condition
|
||||
|
||||
## Output Format
|
||||
|
||||
Provide brief status updates as the pipeline progresses:
|
||||
- "Stage 1: Running triage scan..."
|
||||
- "Stage 1 complete: SEV{N} — {summary}. Spawning {N} specialist agents..."
|
||||
- "Stage 2 complete: {summary of findings}. Running historical analysis..."
|
||||
- "Stage 3 complete: {recurrence status}. Writing report..."
|
||||
- "Stage 4 complete: Report written to {path}"
|
||||
63
.claude/agents/sev-historian.md
Normal file
63
.claude/agents/sev-historian.md
Normal file
|
|
@ -0,0 +1,63 @@
|
|||
---
|
||||
name: sev-historian
|
||||
description: "Stage 3: Cross-reference current incident findings with historical post-mortems, known issues, and architectural patterns. Provides recurrence analysis and historical context."
|
||||
tools: Read, Bash, Grep, Glob
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
You are a historian agent for a homelab Kubernetes cluster's post-mortem pipeline. Your job is to cross-reference current incident findings with historical data to identify recurrence patterns and provide context.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Post-mortems archive**: `/home/wizard/code/infra/docs/post-mortems/`
|
||||
- **Known issues**: `/home/wizard/code/infra/.claude/reference/known-issues.md`
|
||||
- **Patterns**: `/home/wizard/code/infra/.claude/reference/patterns.md`
|
||||
- **Service catalog**: `/home/wizard/code/infra/.claude/reference/service-catalog.md`
|
||||
|
||||
## Inputs
|
||||
|
||||
You will receive in your prompt:
|
||||
- **Triage output** from Stage 1 (severity, affected namespaces/domains, critical findings)
|
||||
- **Investigation findings** from Stage 2 specialist agents (root causes, symptoms, evidence)
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Read all post-mortems** in `docs/post-mortems/` — scan for incidents with the same root cause, same service, or same failure mode as the current incident
|
||||
2. **Read known-issues.md** — check if current findings match documented known issues (helps distinguish new vs recurring problems)
|
||||
3. **Read patterns.md** — check if root cause matches known architectural gotchas or anti-patterns
|
||||
4. **Read service-catalog.md** — understand service tiers and dependencies for cascade analysis. Map the dependency chain: which tier-1 (core) service failures cascade to tier-2/3/4 services?
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never run kubectl or any cluster commands — you only read files
|
||||
- Never fabricate historical references — if there are no matching past incidents, say so
|
||||
|
||||
## Output Format
|
||||
|
||||
Produce output in exactly this structured format:
|
||||
|
||||
```
|
||||
RECURRENCE_CHECK:
|
||||
- [YES|NO] Has this root cause occurred before?
|
||||
- If YES: link to past post-mortem file, what was done last time, did action items get completed?
|
||||
|
||||
KNOWN_ISSUE_MATCH:
|
||||
- [YES|NO] Does this match a documented known issue?
|
||||
- If YES: which one, what's the documented workaround
|
||||
|
||||
PATTERN_MATCH:
|
||||
- Relevant architectural patterns or gotchas from patterns.md
|
||||
- If none match, say "No matching patterns found"
|
||||
|
||||
SERVICE_DEPENDENCIES:
|
||||
- Cascade chain: service A (tier) → service B (tier) → service C (tier)
|
||||
- Based on service-catalog.md tier classification
|
||||
|
||||
HISTORICAL_CONTEXT:
|
||||
- Total post-mortems in archive: N
|
||||
- Related incidents: list with dates and file names
|
||||
- Trend: is this getting more or less frequent?
|
||||
- If first occurrence, say "First recorded incident of this type"
|
||||
```
|
||||
|
||||
Keep output concise and structured. The report-writer agent will incorporate this into the final report.
|
||||
182
.claude/agents/sev-report-writer.md
Normal file
182
.claude/agents/sev-report-writer.md
Normal file
|
|
@ -0,0 +1,182 @@
|
|||
---
|
||||
name: sev-report-writer
|
||||
description: "Stage 4: Synthesize all upstream investigation data into a final post-mortem report with concrete, actionable items including file paths, draft alerts, and code snippets."
|
||||
tools: Read, Write, Bash, Grep, Glob
|
||||
model: opus
|
||||
---
|
||||
|
||||
You are the report-writer for a homelab Kubernetes cluster's post-mortem pipeline. Your job is to synthesize ALL upstream data into a polished, actionable post-mortem report.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Post-mortems archive**: `/home/wizard/code/infra/docs/post-mortems/`
|
||||
- **Post-mortem template**: `/home/wizard/code/infra/.claude/skills/post-mortem/template.md`
|
||||
- **Stacks directory**: `/home/wizard/code/infra/stacks/`
|
||||
- **Service catalog**: `/home/wizard/code/infra/.claude/reference/service-catalog.md`
|
||||
|
||||
## Inputs
|
||||
|
||||
You will receive in your prompt:
|
||||
- **Triage output** from Stage 1 (severity, affected namespaces/domains, timestamps, node status)
|
||||
- **Investigation findings** from Stage 2 specialist agents (root causes, symptoms, evidence)
|
||||
- **Historical context** from Stage 3 historian (recurrence, known issues, patterns, dependencies)
|
||||
|
||||
## Key Improvements Over Basic Reports
|
||||
|
||||
1. **Concrete action items** — every action item must include:
|
||||
- Specific file path: `stacks/<stack>/main.tf:L42` (use Grep to find exact locations)
|
||||
- Draft code snippet where possible (Prometheus alert YAML, Terraform resource block, Helm values change)
|
||||
- Type: Terraform/Helm/Prometheus/UptimeKuma/Runbook
|
||||
|
||||
2. **Proper UTC timeline** — all timestamps in `YYYY-MM-DDTHH:MM:SSZ` format, never relative ("47h ago")
|
||||
|
||||
3. **Recurrence analysis section** — incorporate historian's findings on past incidents and pattern matches
|
||||
|
||||
4. **Auto-severity** — use triage agent's classification with justification
|
||||
|
||||
5. **Source attribution** — every timeline event and finding must reference which agent/tool provided the evidence
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Merge timeline**: Collect all timestamped events from triage + investigation agents into a single chronological list
|
||||
2. **Identify root cause**: The earliest causal event with supporting evidence chain
|
||||
3. **Map to infra files**: Use Grep/Glob to find the exact Terraform/Helm files for affected services
|
||||
4. **Draft action items**: For each issue, create concrete actions with file paths and code snippets
|
||||
5. **Write report** to `/home/wizard/code/infra/docs/post-mortems/YYYY-MM-DD-<slug>.md`
|
||||
6. **Link to GitHub Issue**: If a GitHub Issue number was provided in the prompt:
|
||||
- Include `| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |` in the metadata table
|
||||
- After writing the report, run these commands to link the postmortem to the issue:
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
# Add postmortem comment
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" -H "Accept: application/vnd.github.v3+json" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
|
||||
-d "{\"body\": \"**Postmortem:** [View postmortem](https://viktorbarzin.github.io/infra/post-mortems/<slug>)\"}"
|
||||
# Add postmortem-done label, remove postmortem-required
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" -H "Accept: application/vnd.github.v3+json" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" -d '{"labels":["postmortem-done"]}'
|
||||
curl -s -X DELETE -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels/postmortem-required"
|
||||
```
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never run kubectl or any cluster commands — you only read files and write the report
|
||||
- Never fabricate timeline events — evidence only, with source attribution
|
||||
- Never skip the recurrence analysis section even if historian found nothing (say "First recorded incident")
|
||||
- Never use relative timestamps
|
||||
|
||||
## Report Template
|
||||
|
||||
Write the report to `docs/post-mortems/YYYY-MM-DD-<slug>.md` using this template:
|
||||
|
||||
```markdown
|
||||
# Post-Mortem: <Title>
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | YYYY-MM-DD |
|
||||
| **Duration** | Xh Ym |
|
||||
| **Severity** | SEV1/SEV2/SEV3 |
|
||||
| **Classification** | Justification for severity level |
|
||||
| **Affected Services** | service1, service2 |
|
||||
| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |
|
||||
| **Status** | Draft |
|
||||
|
||||
## Summary
|
||||
|
||||
2-3 sentence overview of what happened, the impact, and the resolution.
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: What users experienced
|
||||
- **Services affected**: Which services and how
|
||||
- **Duration**: How long the impact lasted
|
||||
- **Data loss**: Any data loss (or confirm none)
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time (UTC) | Event | Source |
|
||||
|------------|-------|--------|
|
||||
| YYYY-MM-DDTHH:MM:SSZ | Event description | agent-name / evidence |
|
||||
|
||||
## Root Cause
|
||||
|
||||
Technical explanation of what caused the incident, with evidence chain.
|
||||
Investigate the full causal chain — not just the symptom, but WHY the underlying condition existed.
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
- Factor 1: explanation with evidence
|
||||
- Factor 2: explanation with evidence
|
||||
|
||||
## Recurrence Analysis
|
||||
|
||||
(From historian agent)
|
||||
- Previous incidents with same/similar root cause
|
||||
- Known issue matches
|
||||
- Pattern matches from architectural documentation
|
||||
- Trend analysis
|
||||
|
||||
## Detection
|
||||
|
||||
- **How detected**: Alert / user report / manual check / post-mortem scan
|
||||
- **Time to detect**: Xm from start
|
||||
- **Gap analysis**: What should have caught this earlier
|
||||
|
||||
## Resolution
|
||||
|
||||
What was done (or needs to be done) to resolve the incident.
|
||||
|
||||
## Action Items
|
||||
|
||||
### Preventive (stop recurrence)
|
||||
|
||||
| Priority | Action | File | Draft Change |
|
||||
|----------|--------|------|-------------|
|
||||
| P1 | Description | `stacks/X/main.tf:LN` | ```hcl\nresource snippet\n``` |
|
||||
|
||||
### Detective (catch faster)
|
||||
|
||||
| Priority | Action | Type | Draft Alert/Monitor |
|
||||
|----------|--------|------|-------------------|
|
||||
| P2 | Description | Prometheus/UptimeKuma | ```yaml\nalert rule\n``` |
|
||||
|
||||
### Mitigative (reduce blast radius)
|
||||
|
||||
| Priority | Action | File | Draft Change |
|
||||
|----------|--------|------|-------------|
|
||||
| P3 | Description | `stacks/X/main.tf:LN` | ```hcl\nresource snippet\n``` |
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
- **Went well**: What worked during detection/response
|
||||
- **Went poorly**: What made things worse or slower
|
||||
- **Got lucky**: Things that could have made this much worse
|
||||
|
||||
## Raw Investigation Data
|
||||
|
||||
<details>
|
||||
<summary>Triage output</summary>
|
||||
|
||||
(paste triage output)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Investigation agent findings</summary>
|
||||
|
||||
(paste each agent's output in separate sub-sections)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Historical context</summary>
|
||||
|
||||
(paste historian output)
|
||||
|
||||
</details>
|
||||
```
|
||||
|
||||
After writing the report, output the file path so the orchestrator can inform the user.
|
||||
58
.claude/agents/sev-triage.md
Normal file
58
.claude/agents/sev-triage.md
Normal file
|
|
@ -0,0 +1,58 @@
|
|||
---
|
||||
name: sev-triage
|
||||
description: "Stage 1: Fast cluster scan and severity classification for the post-mortem pipeline. Produces structured triage output for downstream agents."
|
||||
tools: Read, Bash, Grep, Glob
|
||||
model: haiku
|
||||
---
|
||||
|
||||
You are a fast triage agent for a homelab Kubernetes cluster. Your job is to run a quick scan (~60 seconds) and produce structured output for downstream investigation agents.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Kubeconfig**: `/home/wizard/code/infra/config`
|
||||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Context script**: `/home/wizard/code/infra/.claude/scripts/sev-context.sh`
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Run context script**: Execute `bash /home/wizard/code/infra/.claude/scripts/sev-context.sh` to get structured cluster context
|
||||
2. **Classify severity** based on findings:
|
||||
- **SEV1**: Critical path down (Traefik, Authentik, PostgreSQL, DNS, Cloudflared) OR >50% of pods unhealthy
|
||||
- **SEV2**: Partial degradation, non-critical services down, or single critical service degraded but redundant
|
||||
- **SEV3**: Minor issues, cosmetic, single non-critical pod restart
|
||||
3. **Identify affected domains** to inform which specialist agents should be spawned:
|
||||
- `storage` — NFS, PVC, CSI driver issues
|
||||
- `database` — MySQL, PostgreSQL, CNPG, replication
|
||||
- `networking` — DNS, MetalLB, CoreDNS, connectivity
|
||||
- `auth` — Authentik, TLS certs, CrowdSec
|
||||
- `compute` — Node conditions, OOM, resource pressure
|
||||
- `deploy` — Recent rollouts, image pull failures
|
||||
4. **Convert all timestamps to UTC** — never use relative times like "47h ago". Use the pod's `.status.startTime` or event `.lastTimestamp`.
|
||||
5. **Identify investigation hints** — suggest which specialist agents should be spawned based on symptoms.
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never run `kubectl apply`, `patch`, `delete`, or any mutating commands
|
||||
- Never spend more than ~60 seconds investigating — you are a quick scan, not deep investigation
|
||||
|
||||
## Output Format
|
||||
|
||||
You MUST produce output in exactly this structured format:
|
||||
|
||||
```
|
||||
SEVERITY: SEV1|SEV2|SEV3
|
||||
AFFECTED_NAMESPACES: ns1, ns2, ns3
|
||||
AFFECTED_DOMAINS: storage, database, networking, auth, compute, deploy
|
||||
TIME_WINDOW: YYYY-MM-DDTHH:MM — YYYY-MM-DDTHH:MM (UTC)
|
||||
TRIGGER: deploy|config-change|upstream|hardware|unknown
|
||||
NODE_STATUS: node1=Ready, node2=Ready, ...
|
||||
CRITICAL_FINDINGS:
|
||||
- [YYYY-MM-DDTHH:MM:SSZ] finding 1
|
||||
- [YYYY-MM-DDTHH:MM:SSZ] finding 2
|
||||
INVESTIGATION_HINTS:
|
||||
- Suggest spawning: platform-engineer (reason)
|
||||
- Suggest spawning: dba (reason)
|
||||
- Suggest spawning: network-engineer (reason)
|
||||
```
|
||||
|
||||
Keep the output concise and machine-readable. Downstream agents will parse this.
|
||||
|
|
@ -33,7 +33,30 @@ Generate a structured post-mortem document after an incident mitigation session.
|
|||
4. **Update index**: Add an entry to `docs/post-mortems/index.html`
|
||||
- Add a new card in the incidents grid with date, severity tag, title, description
|
||||
|
||||
5. **Commit and push**:
|
||||
5. **Link to GitHub Issue** (if an issue exists for this incident):
|
||||
- Fill in the `Issue` field in the template metadata table with `[#N](https://github.com/ViktorBarzin/infra/issues/N)`
|
||||
- Add a comment to the GitHub Issue linking the postmortem:
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
-H "Accept: application/vnd.github.v3+json" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
|
||||
-d '{"body": "**Postmortem:** [View postmortem](https://viktorbarzin.github.io/infra/post-mortems/<YYYY-MM-DD>-<slug>)"}'
|
||||
```
|
||||
- Add the `postmortem-done` label and remove `postmortem-required`:
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" \
|
||||
-d '{"labels": ["postmortem-done"]}'
|
||||
curl -s -X DELETE \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels/postmortem-required"
|
||||
```
|
||||
- If no issue exists, create one with labels `incident`, `sev<N>`, `postmortem-done`
|
||||
|
||||
6. **Commit and push**:
|
||||
```
|
||||
git add docs/post-mortems/<file>.md docs/post-mortems/index.html
|
||||
git commit -m "docs: post-mortem for <date> <title> [ci skip]"
|
||||
|
|
|
|||
|
|
@ -6,6 +6,7 @@
|
|||
| **Duration** | <DURATION> |
|
||||
| **Severity** | <SEV1/SEV2/SEV3> |
|
||||
| **Affected Services** | <COUNT> pods across <COUNT> namespaces |
|
||||
| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |
|
||||
| **Status** | Draft |
|
||||
|
||||
## Summary
|
||||
|
|
|
|||
5
.github/ISSUE_TEMPLATE/config.yml
vendored
Normal file
5
.github/ISSUE_TEMPLATE/config.yml
vendored
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
blank_issues_enabled: true
|
||||
contact_links:
|
||||
- name: Service Status
|
||||
url: https://status.viktorbarzin.me
|
||||
about: Check current service status and active incidents
|
||||
37
.github/ISSUE_TEMPLATE/outage-report.yml
vendored
Normal file
37
.github/ISSUE_TEMPLATE/outage-report.yml
vendored
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
name: Report an Outage
|
||||
description: Report a service that appears to be down or degraded
|
||||
labels: ["user-report"]
|
||||
body:
|
||||
- type: dropdown
|
||||
id: service
|
||||
attributes:
|
||||
label: Affected Service
|
||||
description: Which service is affected?
|
||||
options:
|
||||
- Nextcloud
|
||||
- Immich
|
||||
- Vaultwarden
|
||||
- Grafana
|
||||
- Plex / Jellyfin
|
||||
- Mail
|
||||
- DNS
|
||||
- VPN / Tailscale
|
||||
- Website / Blog
|
||||
- Music (Navidrome / Freedify)
|
||||
- Other
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: description
|
||||
attributes:
|
||||
label: What's happening?
|
||||
description: Describe what you're seeing. Include error messages, when it started, etc.
|
||||
placeholder: "e.g., Getting 502 errors when trying to access Nextcloud since about 3pm"
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
id: contact
|
||||
attributes:
|
||||
label: Contact (optional)
|
||||
description: How can we reach you with updates?
|
||||
placeholder: Email, Telegram handle, etc.
|
||||
356
stacks/status-page/index.html
Normal file
356
stacks/status-page/index.html
Normal file
|
|
@ -0,0 +1,356 @@
|
|||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<title>Service Status</title>
|
||||
<link rel="preconnect" href="https://fonts.googleapis.com">
|
||||
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
||||
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
|
||||
<style>
|
||||
:root {
|
||||
--bg: #ffffff; --surface: #f8fafb; --fg: #1a202c; --fg2: #64748b; --fg3: #94a3b8;
|
||||
--border: #e2e8f0; --hover: #f1f5f9;
|
||||
--green: #22c55e; --red: #ef4444; --amber: #f59e0b; --indigo: #6366f1;
|
||||
--green-bg: #f0fdf4; --red-bg: #fef2f2; --amber-bg: #fffbeb;
|
||||
--sans: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
|
||||
--mono: 'JetBrains Mono', 'SF Mono', 'Fira Code', monospace;
|
||||
}
|
||||
@media (prefers-color-scheme: dark) {
|
||||
:root {
|
||||
--bg: #0f172a; --surface: #1e293b; --fg: #e2e8f0; --fg2: #94a3b8; --fg3: #64748b;
|
||||
--border: #334155; --hover: #1e293b;
|
||||
--green: #4ade80; --red: #f87171; --amber: #fbbf24; --indigo: #818cf8;
|
||||
--green-bg: #052e16; --red-bg: #450a0a; --amber-bg: #451a03;
|
||||
}
|
||||
}
|
||||
*, *::before, *::after { margin: 0; padding: 0; box-sizing: border-box; }
|
||||
body { font-family: var(--sans); background: var(--bg); color: var(--fg); line-height: 1.5; -webkit-font-smoothing: antialiased; font-size: 14px; }
|
||||
.wrap { max-width: 720px; margin: 0 auto; padding: 32px 20px 64px; }
|
||||
|
||||
header { margin-bottom: 28px; }
|
||||
header h1 { font-size: 20px; font-weight: 600; margin-bottom: 2px; }
|
||||
.ts { color: var(--fg3); font-family: var(--mono); font-size: 12px; }
|
||||
|
||||
.hero { display: flex; align-items: center; gap: 10px; padding: 16px 20px; border-radius: 10px; margin-bottom: 24px; font-weight: 600; font-size: 15px; color: #fff; }
|
||||
.hero-ok { background: var(--green); }
|
||||
.hero-warn { background: var(--amber); color: var(--fg); }
|
||||
.hero-down { background: var(--red); }
|
||||
.hero-dot { width: 10px; height: 10px; border-radius: 50%; background: rgba(255,255,255,0.5); flex-shrink: 0; }
|
||||
.hero-ok .hero-dot { animation: pulse 2s ease-in-out infinite; }
|
||||
@keyframes pulse { 0%, 100% { transform: scale(1); opacity: 0.5; } 50% { transform: scale(1.4); opacity: 1; } }
|
||||
|
||||
.stale { background: var(--amber-bg); color: var(--amber); padding: 10px 16px; border-radius: 8px; font-size: 13px; margin-bottom: 16px; display: none; border: 1px solid color-mix(in srgb, var(--amber) 20%, transparent); }
|
||||
|
||||
/* Incidents */
|
||||
.incidents { margin-bottom: 24px; }
|
||||
.inc-header { font-size: 15px; font-weight: 600; margin-bottom: 10px; display: flex; align-items: center; gap: 8px; }
|
||||
.inc-header .cnt { font-size: 12px; color: var(--fg3); font-weight: 400; }
|
||||
.resolved-header { margin-top: 20px; }
|
||||
|
||||
.inc { background: var(--surface); border: 1px solid var(--border); border-radius: 10px; margin-bottom: 10px; overflow: hidden; }
|
||||
.inc-top { padding: 14px 16px; cursor: pointer; display: flex; align-items: flex-start; gap: 10px; user-select: none; }
|
||||
.inc-top:hover { background: var(--hover); }
|
||||
|
||||
.sev { font-family: var(--mono); font-size: 11px; font-weight: 600; padding: 2px 8px; border-radius: 4px; flex-shrink: 0; text-transform: uppercase; margin-top: 2px; }
|
||||
.sev-1 { background: var(--red-bg); color: var(--red); border: 1px solid color-mix(in srgb, var(--red) 30%, transparent); }
|
||||
.sev-2 { background: var(--amber-bg); color: var(--amber); border: 1px solid color-mix(in srgb, var(--amber) 30%, transparent); }
|
||||
.sev-3 { background: var(--surface); color: var(--fg2); border: 1px solid var(--border); }
|
||||
|
||||
.inc-info { flex: 1; min-width: 0; }
|
||||
.inc-title { font-size: 14px; font-weight: 600; }
|
||||
.inc-meta { font-size: 12px; color: var(--fg3); margin-top: 2px; display: flex; gap: 12px; flex-wrap: wrap; }
|
||||
.inc-services { display: flex; gap: 4px; flex-wrap: wrap; margin-top: 6px; }
|
||||
.inc-svc { font-size: 11px; padding: 1px 8px; border-radius: 4px; background: var(--hover); border: 1px solid var(--border); color: var(--fg2); }
|
||||
|
||||
.inc-tl { border-top: 1px solid var(--border); padding: 12px 16px; display: none; }
|
||||
.inc.open .inc-tl { display: block; }
|
||||
.tl-entry { position: relative; padding-left: 20px; padding-bottom: 14px; border-left: 2px solid var(--border); margin-left: 4px; }
|
||||
.tl-entry:last-child { padding-bottom: 0; }
|
||||
.tl-entry::before { content: ''; position: absolute; left: -5px; top: 4px; width: 8px; height: 8px; border-radius: 50%; background: var(--fg3); border: 2px solid var(--surface); }
|
||||
.tl-time { font-family: var(--mono); font-size: 11px; color: var(--fg3); }
|
||||
.tl-status { font-size: 12px; font-weight: 600; color: var(--fg); display: inline; }
|
||||
.tl-body { font-size: 13px; color: var(--fg2); margin-top: 2px; white-space: pre-wrap; word-break: break-word; }
|
||||
|
||||
.inc-links { margin-top: 10px; font-size: 12px; display: flex; gap: 14px; }
|
||||
.inc-links a { color: var(--indigo); text-decoration: none; }
|
||||
.inc-links a:hover { text-decoration: underline; }
|
||||
|
||||
.inc-resolved { opacity: 0.7; }
|
||||
.inc-resolved:hover { opacity: 1; }
|
||||
|
||||
.sev-ur { background: color-mix(in srgb, var(--indigo) 15%, transparent); color: var(--indigo); border: 1px solid color-mix(in srgb, var(--indigo) 30%, transparent); }
|
||||
|
||||
.report-bar { display: flex; align-items: center; justify-content: space-between; gap: 12px; padding: 12px 16px; border-radius: 10px; margin-bottom: 24px; background: var(--surface); border: 1px solid var(--border); }
|
||||
.report-bar span { font-size: 13px; color: var(--fg2); }
|
||||
.report-btn { font-family: var(--sans); font-size: 12px; font-weight: 600; padding: 6px 16px; border-radius: 6px; background: var(--indigo); color: #fff; text-decoration: none; white-space: nowrap; transition: opacity 0.15s; }
|
||||
.report-btn:hover { opacity: 0.85; }
|
||||
|
||||
.bar { display: flex; gap: 6px; margin-bottom: 20px; flex-wrap: wrap; align-items: center; }
|
||||
.bar label { font-size: 11px; color: var(--fg3); text-transform: uppercase; letter-spacing: 0.06em; font-weight: 500; }
|
||||
.fbtn { font-family: var(--sans); font-size: 12px; padding: 4px 12px; border-radius: 6px; border: 1px solid var(--border); background: transparent; color: var(--fg2); cursor: pointer; font-weight: 500; }
|
||||
.fbtn:hover { border-color: var(--fg3); color: var(--fg); }
|
||||
.fbtn.on { background: var(--fg); color: var(--bg); border-color: var(--fg); }
|
||||
.bar select { font-family: var(--sans); font-size: 12px; padding: 4px 8px; border-radius: 6px; border: 1px solid var(--border); background: var(--bg); color: var(--fg); cursor: pointer; }
|
||||
|
||||
.g { background: var(--surface); border: 1px solid var(--border); border-radius: 10px; margin-bottom: 12px; overflow: hidden; }
|
||||
.g.hide { display: none; }
|
||||
.gh { padding: 14px 16px; cursor: pointer; display: flex; align-items: center; justify-content: space-between; user-select: none; }
|
||||
.gh:hover { background: var(--hover); }
|
||||
.gt { font-weight: 600; font-size: 13px; display: flex; align-items: center; gap: 8px; }
|
||||
.gt .n { font-weight: 400; color: var(--fg3); font-size: 12px; }
|
||||
.chev { font-size: 10px; color: var(--fg3); transition: transform 0.15s; display: inline-block; }
|
||||
.g.shut .chev { transform: rotate(-90deg); }
|
||||
.g.shut .gb { display: none; }
|
||||
.gs { font-family: var(--mono); font-size: 12px; display: flex; gap: 8px; }
|
||||
|
||||
.gb { border-top: 1px solid var(--border); }
|
||||
.colh { display: flex; align-items: center; padding: 6px 16px; gap: 8px; }
|
||||
.colh-sp { width: 8px; flex-shrink: 0; }
|
||||
.colh-n { flex: 1; font-size: 10px; color: var(--fg3); text-transform: uppercase; letter-spacing: 0.08em; font-weight: 500; }
|
||||
.colh-v { display: flex; gap: 2px; }
|
||||
.colh-l { width: 52px; text-align: right; font-size: 10px; color: var(--fg3); text-transform: uppercase; letter-spacing: 0.06em; font-weight: 500; }
|
||||
|
||||
.row { display: flex; align-items: center; padding: 8px 16px; gap: 8px; border-top: 1px solid var(--border); }
|
||||
.row:first-of-type { border-top: none; }
|
||||
.row:hover { background: var(--hover); }
|
||||
.row.hide { display: none; }
|
||||
.d { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; }
|
||||
.d-up { background: var(--green); }
|
||||
.d-dn { background: var(--red); box-shadow: 0 0 0 3px rgba(239,68,68,0.15); }
|
||||
.d-pn { background: var(--amber); }
|
||||
.mn { flex: 1; font-size: 13px; font-weight: 500; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }
|
||||
.mn a { color: inherit; text-decoration: none; border-bottom: 1px solid transparent; transition: border-color 0.15s; }
|
||||
.mn a:hover { color: var(--indigo); border-bottom-color: var(--indigo); }
|
||||
.uv { display: flex; gap: 2px; font-family: var(--mono); font-size: 12px; }
|
||||
.uv span { width: 52px; text-align: right; color: var(--fg3); }
|
||||
.uv .ok { color: var(--green); }
|
||||
.uv .wn { color: var(--amber); }
|
||||
.uv .bd { color: var(--red); }
|
||||
|
||||
footer { color: var(--fg3); font-size: 11px; margin-top: 32px; padding-top: 16px; border-top: 1px solid var(--border); text-align: center; }
|
||||
.ld { text-align: center; padding: 60px 0; color: var(--fg3); }
|
||||
.err { text-align: center; padding: 40px 0; color: var(--red); }
|
||||
|
||||
@media (max-width: 480px) {
|
||||
.wrap { padding: 20px 14px 40px; }
|
||||
.uv span, .colh-l { width: 42px; font-size: 11px; }
|
||||
.row, .colh { padding-left: 12px; padding-right: 12px; }
|
||||
.gh { padding: 12px; }
|
||||
.inc-top { padding: 12px; }
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="wrap">
|
||||
<header>
|
||||
<h1>Service Status</h1>
|
||||
<div class="ts" id="ts"></div>
|
||||
</header>
|
||||
<div class="stale" id="stale"></div>
|
||||
<div class="hero" id="hero"></div>
|
||||
<div class="report-bar">
|
||||
<span>Something not working?</span>
|
||||
<a href="https://github.com/ViktorBarzin/infra/issues/new?template=outage-report.yml" target="_blank" rel="noopener" class="report-btn">Report an Outage</a>
|
||||
</div>
|
||||
<div id="incidents"></div>
|
||||
<div class="bar" id="bar" style="display:none">
|
||||
<label>Show:</label>
|
||||
<button class="fbtn on" data-f="all">All</button>
|
||||
<button class="fbtn" data-f="up">Up</button>
|
||||
<button class="fbtn" data-f="down">Down</button>
|
||||
<span style="flex:1"></span>
|
||||
<label>Sort:</label>
|
||||
<select id="ss">
|
||||
<option value="status">Status</option>
|
||||
<option value="name">Name</option>
|
||||
<option value="u-asc">Uptime asc</option>
|
||||
<option value="u-desc">Uptime desc</option>
|
||||
</select>
|
||||
</div>
|
||||
<div id="gs"><div class="ld">Loading…</div></div>
|
||||
<footer>Updated every 5 minutes · Powered by Uptime Kuma · <a href="https://github.com/ViktorBarzin/infra/issues" style="color:var(--fg3)">Report issues</a></footer>
|
||||
</div>
|
||||
<script>
|
||||
(function(){
|
||||
var U='status.json',S=6e5,D=null,F='all',O='status';
|
||||
|
||||
function esc(s){var d=document.createElement('div');d.textContent=s||'';return d.innerHTML}
|
||||
function ago(d){var s=Math.floor((Date.now()-d)/1e3);if(s<0)s=0;return s<60?s+'s ago':s<3600?Math.floor(s/60)+'m ago':s<86400?Math.floor(s/3600)+'h ago':Math.floor(s/86400)+'d ago'}
|
||||
function dur(start,end){var m=Math.floor((end-start)/6e4);if(m<1)return '<1m';return m<60?m+'m':Math.floor(m/60)+'h '+m%60+'m'}
|
||||
function uc(p){return p==null?'':p>=99?'ok':p>=95?'wn':'bd'}
|
||||
function pf(p){return p==null?'\u2014':p.toFixed(1)+'%'}
|
||||
|
||||
function srt(a){return a.slice().sort(function(x,y){
|
||||
if(O==='name')return x.name.localeCompare(y.name);
|
||||
if(O==='u-asc'){var xa=x.uptime_24h==null?101:x.uptime_24h,ya=y.uptime_24h==null?101:y.uptime_24h;return xa-ya}
|
||||
if(O==='u-desc'){var xd=x.uptime_24h==null?-1:x.uptime_24h,yd=y.uptime_24h==null?-1:y.uptime_24h;return yd-xd}
|
||||
var o={down:0,pending:1,up:2},ao=o[x.status]!=null?o[x.status]:1,bo=o[y.status]!=null?o[y.status]:1;
|
||||
return ao!==bo?ao-bo:x.name.localeCompare(y.name);
|
||||
})}
|
||||
function fm(m){return F==='all'||(F==='up'?m.status==='up':m.status!=='up')}
|
||||
|
||||
function buildIncident(inc,resolved){
|
||||
var isReport=inc.type==='user-report';
|
||||
var sevNum=isReport?0:inc.severity==='sev1'?1:inc.severity==='sev2'?2:3;
|
||||
var created=new Date(inc.created_at);
|
||||
var end=resolved&&inc.closed_at?new Date(inc.closed_at):new Date();
|
||||
|
||||
var el=document.createElement('div');
|
||||
el.className='inc'+(resolved?' inc-resolved':'');
|
||||
|
||||
// Top bar
|
||||
var top=document.createElement('div');
|
||||
top.className='inc-top';
|
||||
var badgeHtml=isReport
|
||||
?'<div class="sev sev-ur">REPORT</div>'
|
||||
:'<div class="sev sev-'+sevNum+'">SEV'+sevNum+'</div>';
|
||||
var html=badgeHtml;
|
||||
html+='<div class="inc-info"><div class="inc-title">'+esc(inc.title)+'</div>';
|
||||
html+='<div class="inc-meta"><span>'+ago(created)+'</span>';
|
||||
if(!isReport)html+='<span>'+dur(created,end)+'</span>';
|
||||
if(resolved)html+='<span style="color:var(--green)">Resolved</span>';
|
||||
html+='</div>';
|
||||
if(inc.affected_services&&inc.affected_services.length){
|
||||
html+='<div class="inc-services">';
|
||||
for(var i=0;i<inc.affected_services.length;i++)html+='<span class="inc-svc">'+esc(inc.affected_services[i])+'</span>';
|
||||
html+='</div>';
|
||||
}
|
||||
html+='</div><span class="chev">▸</span>';
|
||||
top.innerHTML=html;
|
||||
top.onclick=function(){el.classList.toggle('open')};
|
||||
el.appendChild(top);
|
||||
|
||||
// Timeline
|
||||
var tl=document.createElement('div');
|
||||
tl.className='inc-tl';
|
||||
if(inc.timeline&&inc.timeline.length){
|
||||
for(var i=inc.timeline.length-1;i>=0;i--){
|
||||
var te=inc.timeline[i];
|
||||
var entry=document.createElement('div');
|
||||
entry.className='tl-entry';
|
||||
entry.innerHTML='<div class="tl-time">'+new Date(te.timestamp).toLocaleString()+'</div>'
|
||||
+'<div class="tl-status">'+esc(te.status)+'</div>'
|
||||
+'<div class="tl-body">'+esc(te.body)+'</div>';
|
||||
tl.appendChild(entry);
|
||||
}
|
||||
}
|
||||
// Links
|
||||
var links=document.createElement('div');
|
||||
links.className='inc-links';
|
||||
if(inc.postmortem)links.innerHTML+='<a href="'+esc(inc.postmortem)+'" target="_blank" rel="noopener">Postmortem</a>';
|
||||
links.innerHTML+='<a href="'+esc(inc.url)+'" target="_blank" rel="noopener">View on GitHub →</a>';
|
||||
tl.appendChild(links);
|
||||
el.appendChild(tl);
|
||||
|
||||
return el;
|
||||
}
|
||||
|
||||
function render(data){
|
||||
D=data;
|
||||
var t=new Date(data.last_updated),age=Date.now()-t.getTime();
|
||||
document.getElementById('ts').textContent=ago(t);
|
||||
var st=document.getElementById('stale');
|
||||
if(age>S){st.textContent='Data is '+Math.floor(age/6e4)+' minutes old. Monitoring may be unreachable.';st.style.display='block'}else st.style.display='none';
|
||||
|
||||
var gs={};
|
||||
for(var gn in data.groups){var a=data.groups[gn].filter(function(m){return m.status!=='paused'});if(a.length)gs[gn]=a}
|
||||
|
||||
var tu=0,td=0;
|
||||
for(var g in gs)for(var i=0;i<gs[g].length;i++)gs[g][i].status==='up'?tu++:td++;
|
||||
|
||||
// Incidents
|
||||
var inc=data.incidents||{active:[],resolved:[]};
|
||||
var incEl=document.getElementById('incidents');
|
||||
incEl.innerHTML='';
|
||||
|
||||
// Hero — incidents take priority
|
||||
var h=document.getElementById('hero');
|
||||
if(inc.active.length>0){
|
||||
var maxSev=3;
|
||||
for(var si=0;si<inc.active.length;si++){
|
||||
var s=inc.active[si].severity==='sev1'?1:inc.active[si].severity==='sev2'?2:3;
|
||||
if(s<maxSev)maxSev=s;
|
||||
}
|
||||
if(maxSev===1){h.className='hero hero-down';h.innerHTML='<div class="hero-dot"></div>Active Incident \u2014 SEV1'}
|
||||
else{h.className='hero hero-warn';h.innerHTML='<div class="hero-dot"></div>'+inc.active.length+' Active Incident'+(inc.active.length>1?'s':'')}
|
||||
}else if(!td){h.className='hero hero-ok';h.innerHTML='<div class="hero-dot"></div>All Systems Operational'}
|
||||
else if(td<=3){h.className='hero hero-warn';h.innerHTML='<div class="hero-dot"></div>'+td+' service'+(td>1?'s':'')+' experiencing issues'}
|
||||
else{h.className='hero hero-down';h.innerHTML='<div class="hero-dot"></div>'+td+' services down'}
|
||||
|
||||
// Render active incidents
|
||||
if(inc.active.length>0){
|
||||
var ah=document.createElement('div');
|
||||
ah.className='inc-header';
|
||||
ah.innerHTML='Active Incidents <span class="cnt">'+inc.active.length+'</span>';
|
||||
incEl.appendChild(ah);
|
||||
for(var ai=0;ai<inc.active.length;ai++)incEl.appendChild(buildIncident(inc.active[ai],false));
|
||||
}
|
||||
|
||||
// Render user reports
|
||||
var reports=inc.user_reports||[];
|
||||
if(reports.length>0){
|
||||
var urh=document.createElement('div');
|
||||
urh.className='inc-header';
|
||||
urh.innerHTML='User Reports <span class="cnt">'+reports.length+'</span>';
|
||||
incEl.appendChild(urh);
|
||||
for(var ui=0;ui<reports.length;ui++)incEl.appendChild(buildIncident(reports[ui],false));
|
||||
}
|
||||
|
||||
// Render resolved incidents
|
||||
if(inc.resolved.length>0){
|
||||
var rh=document.createElement('div');
|
||||
rh.className='inc-header resolved-header';
|
||||
rh.innerHTML='Recently Resolved <span class="cnt">last 7 days</span>';
|
||||
incEl.appendChild(rh);
|
||||
for(var ri=0;ri<inc.resolved.length;ri++)incEl.appendChild(buildIncident(inc.resolved[ri],true));
|
||||
}
|
||||
|
||||
// Monitor groups
|
||||
document.getElementById('bar').style.display='flex';
|
||||
var c=document.getElementById('gs');c.innerHTML='';
|
||||
var ks=Object.keys(gs).sort(function(a,b){return gs[b].length-gs[a].length});
|
||||
|
||||
for(var ki=0;ki<ks.length;ki++){
|
||||
var gn=ks[ki],ms=gs[gn],so=srt(ms),vc=so.filter(fm).length;
|
||||
var ge=document.createElement('div');ge.className='g'+(vc?'':' hide');
|
||||
|
||||
var up=ms.filter(function(m){return m.status==='up'}).length,dn=ms.length-up;
|
||||
var hd=document.createElement('div');hd.className='gh';
|
||||
hd.innerHTML='<div class="gt"><span class="chev">▸</span>'+gn+' <span class="n">'+ms.length+'</span></div><div class="gs">'+(dn?'<span style="color:var(--red)">'+dn+' down</span>':'')+'<span style="color:var(--green)">'+up+' up</span></div>';
|
||||
hd.onclick=function(){this.parentElement.classList.toggle('shut')};
|
||||
|
||||
var bd=document.createElement('div');bd.className='gb';
|
||||
var ch=document.createElement('div');ch.className='colh';
|
||||
ch.innerHTML='<div class="colh-sp"></div><div class="colh-n">Service</div><div class="colh-v"><div class="colh-l">24h</div><div class="colh-l">7d</div><div class="colh-l">30d</div></div>';
|
||||
bd.appendChild(ch);
|
||||
|
||||
for(var mi=0;mi<so.length;mi++){
|
||||
var m=so[mi],dc=m.status==='up'?'d-up':m.status==='pending'?'d-pn':'d-dn';
|
||||
var r=document.createElement('div');r.className='row'+(fm(m)?'':' hide');
|
||||
var nameHtml=m.name;
|
||||
if(m.url){nameHtml='<a href="'+m.url+'" target="_blank" rel="noopener">'+m.name+'</a>'}
|
||||
r.innerHTML='<div class="d '+dc+'"></div><div class="mn">'+nameHtml+'</div><div class="uv"><span class="'+uc(m.uptime_24h)+'">'+pf(m.uptime_24h)+'</span><span class="'+uc(m.uptime_7d)+'">'+pf(m.uptime_7d)+'</span><span class="'+uc(m.uptime_30d)+'">'+pf(m.uptime_30d)+'</span></div>';
|
||||
bd.appendChild(r);
|
||||
}
|
||||
ge.appendChild(hd);ge.appendChild(bd);c.appendChild(ge);
|
||||
}
|
||||
}
|
||||
|
||||
document.addEventListener('click',function(e){
|
||||
if(!e.target.classList.contains('fbtn'))return;
|
||||
var bs=document.querySelectorAll('.fbtn');for(var i=0;i<bs.length;i++)bs[i].classList.remove('on');
|
||||
e.target.classList.add('on');F=e.target.getAttribute('data-f');if(D)render(D);
|
||||
});
|
||||
document.getElementById('ss').onchange=function(){O=this.value;if(D)render(D)};
|
||||
|
||||
function load(){
|
||||
fetch(U+'?t='+Date.now()).then(function(r){if(!r.ok)throw 0;return r.json()}).then(render)
|
||||
.catch(function(){document.getElementById('gs').innerHTML='<div class="err">Could not load status data.</div>';
|
||||
var h=document.getElementById('hero');h.className='hero hero-down';h.innerHTML='<div class="hero-dot"></div>Status Unavailable'});
|
||||
}
|
||||
load();setInterval(load,6e4);
|
||||
})();
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
8
stacks/status-page/terragrunt.hcl
Normal file
8
stacks/status-page/terragrunt.hcl
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "infra" {
|
||||
config_path = "../infra"
|
||||
skip_outputs = true
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue