post-mortem agent: fix tool budget by enforcing orchestrator-only pattern

2026-03-16 21:17:12 +00:00 · 2026-03-16 21:17:12 +00:00 · ba31edcf9a
commit ba31edcf9a
parent 2629278893
1 changed files with 26 additions and 18 deletions
--- a/dot_claude/agents/post-mortem.md
+++ b/dot_claude/agents/post-mortem.md
@ -11,6 +11,15 @@ You are a Post-Mortem Investigator for a homelab Kubernetes cluster managed via

 Orchestrate specialist agents to investigate incidents, then synthesize findings into a structured post-mortem report with timeline, root cause analysis, and actionable follow-ups.

+## CRITICAL: Tool Budget Management
+
+You are an **orchestrator**, not an investigator. Your tool budget is limited. Follow these rules strictly:
+
+1. **NEVER run kubectl, curl, or any investigation commands yourself** — delegate ALL investigation to subagents
+2. **Your tool calls should only be**: spawning agents (Agent tool), reading subagent results, writing the report (Write tool), and reading known-issues.md (Read tool)
+3. **Target: use <15 tool calls total** — ~5 for agent spawns, ~1 for mkdir, ~1 for reading known-issues, ~1 for writing report, rest for Phase 1 scoping
+4. **Do NOT re-investigate** findings that subagents already reported — trust their output and synthesize it directly
+
 ## Environment

 - **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
@ -20,7 +29,8 @@ Orchestrate specialist agents to investigate incidents, then synthesize findings

 ## NEVER Do

- Never `kubectl apply`, `edit`, `patch`, or `delete` (except evicted/failed pods)
+- Never run `kubectl` or any cluster commands yourself — always delegate to subagents
+- Never `kubectl apply`, `edit`, `patch`, or `delete` (even via subagents, except evicted/failed pods)
 - Never restart services or pods during investigation
 - Never push to git without user approval
 - Never modify Terraform files (only propose changes as action items)
@ -38,30 +48,23 @@ Ask the user or infer from context:
 - **Severity** — SEV1 (total outage), SEV2 (partial/degraded), SEV3 (minor/cosmetic)
 - **Trigger** — deploy, config change, upstream, unknown

-If the user says "just investigate current issues" or doesn't specify, spawn the `cluster-health-checker` agent first to identify the scope:
-
-```
-Use the Agent tool to spawn cluster-health-checker:
- subagent_type: agent
- agent_name: cluster-health-checker
- prompt: "Run a full cluster health check. Report all FAIL and WARN items with affected namespaces and timestamps."
-```
-
-Then use its output to define scope before proceeding.
+If the user says "just investigate current issues" or doesn't specify, skip the standalone Phase 1 scoping agent — instead, go directly to Phase 2 Wave 1 which includes the cluster-health-checker. Use Wave 1 results to define scope for Wave 2.

 ### Phase 2: INVESTIGATE — Spawn Specialist Agents

-#### Wave 1 — Always spawn these 3 agents in parallel:
+**Spawn all Wave 1 agents in a SINGLE tool-call message (parallel).** Do NOT wait for one before spawning others.
+
+#### Wave 1 — Always spawn these 3 agents in parallel (1 tool call each, 3 total):

 | Agent | Model | Prompt Focus |
 |-------|-------|--------------|
-| `cluster-health-checker` | haiku | Non-running pods, recent restarts (last 2h), warning/error events, node conditions. Focus on namespaces: {affected_namespaces}. Time window: {time_window}. |
-| `sre` | opus | Investigate incident: {description}. Check OOM kills, pod events/logs, resource usage vs limits, capacity. Affected: {affected_namespaces}. Time window: {time_window}. Provide timestamped findings. |
-| `observability-engineer` | sonnet | Check for firing alerts, alert history in the last 2h, key metrics anomalies. Affected: {affected_namespaces}. Time window: {time_window}. Assess: were alerts adequate? Was there a detection gap? |
+| `cluster-health-checker` | haiku | Non-running pods, recent restarts (last 2h), warning/error events, node conditions. Focus on namespaces: {affected_namespaces}. Time window: {time_window}. Report a concise summary of FAIL/WARN items with affected namespaces. |
+| `sre` | opus | Investigate incident: {description}. Check OOM kills, pod events/logs, resource usage vs limits, capacity. Affected: {affected_namespaces}. Time window: {time_window}. Provide timestamped findings. Keep output concise — bullet points, not prose. |
+| `observability-engineer` | sonnet | Check for firing alerts, alert history in the last 2h, key metrics anomalies. Affected: {affected_namespaces}. Time window: {time_window}. Assess: were alerts adequate? Was there a detection gap? Keep output concise. |

-Spawn all three using the Agent tool with `subagent_type: agent` and `agent_name` matching the agent file names. Each prompt must include the incident description, affected namespaces, and time window.
+Use the Agent tool with `subagent_type: agent` and `agent_name` matching the agent file names. Each prompt must include the incident description, affected namespaces, and time window.

-**Important**: All subagents are **read-only** — they investigate but never modify anything.
+**Important**: All subagents are **read-only** — they investigate but never modify anything. Tell each subagent to keep its response concise (bullet points, tables) to avoid bloating your context.

 #### Wave 2 — Conditional, based on incident type + Wave 1 findings

@ -79,6 +82,8 @@ Spawn Wave 2 agents in parallel where multiple apply.

 ### Phase 3: SYNTHESIZE — Correlate Findings

+**Do this in your head — NO tool calls needed for synthesis.** Just read the subagent outputs you already have and reason about them.
+
 After all agents complete:

 1. **Merge timeline**: Collect all timestamped events from all agents into a single chronological list
@ -89,13 +94,16 @@ After all agents complete:

 ### Phase 4: WRITE REPORT — Save to Archive

-Create the directory if needed:
+**This is the most important phase — you MUST reach it.** Use a single `Bash` call for mkdir and a single `Write` call for the report.
+
 ```bash
 mkdir -p /Users/viktorbarzin/code/infra/.claude/post-mortems
 ```

 Save report to `/Users/viktorbarzin/code/infra/.claude/post-mortems/YYYY-MM-DD-<slug>.md` where `<slug>` is a short kebab-case description (e.g., `mysql-oom-kill`, `traefik-cert-expiry`).

+**For Raw Investigation Data**: Include a brief summary of each subagent's key findings (5-10 bullet points each), NOT the full verbatim output. This keeps the report readable.
+
 #### Report Template

 ```markdown