fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
6d224861c4
commit
fd0f4a0365
1166 changed files with 358546 additions and 0 deletions
78
.claude/skills/post-mortem/skill.md
Normal file
78
.claude/skills/post-mortem/skill.md
Normal file
|
|
@ -0,0 +1,78 @@
|
|||
# Post-Mortem Writer
|
||||
|
||||
Generate a structured post-mortem document after an incident mitigation session.
|
||||
|
||||
## When to use
|
||||
- After `/post-mortem` command
|
||||
- Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY
|
||||
|
||||
## Instructions
|
||||
|
||||
1. **Gather context**:
|
||||
- Run `.claude/scripts/sev-context.sh` to capture current cluster state
|
||||
- Review the conversation history for: what broke, timeline, root cause, what was fixed
|
||||
- Check existing post-mortems at `docs/post-mortems/` for format reference
|
||||
|
||||
2. **Generate the post-mortem**:
|
||||
- Use the template at `.claude/skills/post-mortem/template.md`
|
||||
- Fill in all sections from the investigation context
|
||||
- **Critical**: In the Prevention Plan tables, set the `Type` column correctly:
|
||||
- `Alert` — add/modify Prometheus alerting rules (auto-implementable)
|
||||
- `Config` — change Terraform config, NFS options, etc. (auto-implementable)
|
||||
- `Monitor` — add Uptime Kuma monitors (auto-implementable)
|
||||
- `Architecture` — storage migration, stack redesign (human-only)
|
||||
- `Investigation` — needs further research (human-only)
|
||||
- `Runbook` — document a procedure (human-only)
|
||||
- `Migration` — data or service migration (human-only)
|
||||
- Items already fixed during the session should have Status = `Done`
|
||||
- Items not yet done should have Status = `TODO`
|
||||
|
||||
3. **File naming**: `docs/post-mortems/<YYYY-MM-DD>-<slug>.md`
|
||||
- Slug: lowercase, hyphenated, max 5 words describing the incident
|
||||
|
||||
4. **Update index**: Add an entry to `docs/post-mortems/index.html`
|
||||
- Add a new card in the incidents grid with date, severity tag, title, description
|
||||
|
||||
5. **Link to GitHub Issue** (if an issue exists for this incident):
|
||||
- Fill in the `Issue` field in the template metadata table with `[#N](https://github.com/ViktorBarzin/infra/issues/N)`
|
||||
- Add a comment to the GitHub Issue linking the postmortem:
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
-H "Accept: application/vnd.github.v3+json" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
|
||||
-d '{"body": "**Postmortem:** [View postmortem](https://viktorbarzin.github.io/infra/post-mortems/<YYYY-MM-DD>-<slug>)"}'
|
||||
```
|
||||
- Add the `postmortem-done` label and remove `postmortem-required`:
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" \
|
||||
-d '{"labels": ["postmortem-done"]}'
|
||||
curl -s -X DELETE \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels/postmortem-required"
|
||||
```
|
||||
- If no issue exists, create one with labels `incident`, `sev<N>`, `postmortem-done`
|
||||
|
||||
6. **Commit and push**:
|
||||
```
|
||||
git add docs/post-mortems/<file>.md docs/post-mortems/index.html
|
||||
git commit -m "docs: post-mortem for <date> <title> [ci skip]"
|
||||
git push origin master
|
||||
```
|
||||
- Use `[ci skip]` to avoid triggering app-stacks pipeline
|
||||
- NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter)
|
||||
|
||||
## Type Reference for Prevention Plan
|
||||
|
||||
| Type | Auto-implementable? | Examples |
|
||||
|------|---------------------|----------|
|
||||
| Alert | Yes | Add PrometheusRule, modify alert thresholds |
|
||||
| Config | Yes | Change Terraform variables, mount options, CronJob schedules |
|
||||
| Monitor | Yes | Add Uptime Kuma HTTP/TCP monitor |
|
||||
| Architecture | No | Migrate storage class, redesign HA topology |
|
||||
| Investigation | No | Research kernel bug, check Proxmox forum |
|
||||
| Runbook | No | Document recovery procedure |
|
||||
| Migration | No | Move data between storage backends |
|
||||
86
.claude/skills/post-mortem/template.md
Normal file
86
.claude/skills/post-mortem/template.md
Normal file
|
|
@ -0,0 +1,86 @@
|
|||
# Post-Mortem: <TITLE>
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | <DATE> |
|
||||
| **Duration** | <DURATION> |
|
||||
| **Severity** | <SEV1/SEV2/SEV3> |
|
||||
| **Affected Services** | <COUNT> pods across <COUNT> namespaces |
|
||||
| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |
|
||||
| **Status** | Draft |
|
||||
|
||||
## Summary
|
||||
|
||||
<1-2 sentence summary of the incident.>
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: <What users experienced>
|
||||
- **Blast radius**: <How many services/pods/namespaces affected>
|
||||
- **Duration**: <How long the outage lasted>
|
||||
- **Data loss**: <None/details>
|
||||
- **Monitoring gap**: <Any blind spots in alerting>
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| **HH:MM** | <First sign of trouble> |
|
||||
| **HH:MM** | <Detection / user report> |
|
||||
| **HH:MM** | <Investigation begins> |
|
||||
| **HH:MM** | <Root cause identified> |
|
||||
| **HH:MM** | <Fix applied> |
|
||||
| **HH:MM** | <Service restored> |
|
||||
|
||||
## Root Cause
|
||||
|
||||
<Narrative description of what went wrong and why.>
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
1. <Factor that made the incident worse or harder to detect>
|
||||
2. <Factor...>
|
||||
|
||||
## Detection Gaps
|
||||
|
||||
| Gap | Impact | Fix |
|
||||
|-----|--------|-----|
|
||||
| <What wasn't monitored> | <How it delayed detection> | <What to add> |
|
||||
|
||||
## Prevention Plan
|
||||
|
||||
### P0 — Prevent this exact failure
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P0 | <action> | Config | <details> | TODO |
|
||||
|
||||
### P1 — Reduce blast radius
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P1 | <action> | Alert | <details> | TODO |
|
||||
|
||||
### P2 — Detect faster
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P2 | <action> | Monitor | <details> | TODO |
|
||||
|
||||
### P3 — Improve resilience
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P3 | <action> | Architecture | <details> | TODO |
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. <Key takeaway>
|
||||
2. <Key takeaway>
|
||||
|
||||
## Follow-up Implementation
|
||||
|
||||
_This section is auto-populated by the postmortem-todo-resolver agent._
|
||||
|
||||
| Date | Action | Priority | Type | Commit | Implemented By |
|
||||
|------|--------|----------|------|--------|----------------|
|
||||
Loading…
Add table
Add a link
Reference in a new issue