beadboard/docs/plans/2026-03-07-phase-6-runtime-hardening.md
zenchantlive d335e5bf71 fix: orchestrator button + Pi SDK session error
- Move leftSidebarMode from URL state to local useState in unified-shell,
    avoiding force-dynamic router round-trip that made the button appear broken                                           - Replace fileURLToPath(new URL(..., import.meta.url)) with process.cwd()
    in bb-pi-bootstrap.ts — import.meta.url is a webpack:// URL in Next.js,
    causing cross-realm TypeError when passed to Node.js fileURLToPath()
2026-03-24 19:02:04 -05:00

75 lines
1.6 KiB
Markdown

# Phase 6 - Runtime Hardening
**Status:** Planning
**Created:** 2026-03-07
**Goal:** Improve robustness of embedded Pi runtime, reconnect behavior, and error recovery
---
## Current Issues
1. Session disconnect requires manual restart
2. Stuck/hung agents have no clear diagnostics
3. Drift between TUI Pi loader and embedded Pi loader
4. No automatic recovery from failures
---
## Implementation Plan
### Step 1: Session Health Monitoring
- Add heartbeat check for orchestrator session
- Detect when session is unresponsive
- Show clear status in UI
**Files:**
- `src/lib/pi-daemon-adapter.ts` - health check
- `src/lib/embedded-daemon.ts` - monitoring
### Step 2: Automatic Reconnect
- On session disconnect, attempt reconnect
- Preserve conversation history
- Show reconnect status to user
**Files:**
- `src/lib/pi-daemon-adapter.ts`
- `src/components/shared/left-panel.tsx` - reconnect UI
### Step 3: Stuck Agent Diagnostics
- Detect agents stuck in "spawning" for too long
- Provide diagnostic information
- Allow user to cancel/retry
**Files:**
- `src/lib/worker-session-manager.ts`
- `src/components/agents/agent-status-panel.tsx`
### Step 4: Error Recovery UX
- Clear error messages when things fail
- Retry buttons for failed operations
- Logs for debugging
**Files:**
- Error handling across runtime components
- UI for error display
---
## Blocked Items
None identified.
---
## Success Criteria
- [ ] Session health monitored and shown in UI
- [ ] Automatic reconnect on disconnect
- [ ] Stuck agents detected and reported
- [ ] Clear error messages with recovery options
---
## Estimated Effort
3-4 hours