175 lines
No EOL
4.7 KiB
Markdown
175 lines
No EOL
4.7 KiB
Markdown
# Test Results
|
|
|
|
**TDD methodology applied to skill creation.**
|
|
|
|
## Methodology
|
|
|
|
Following TDD for skills: RED (baseline failures) → GREEN (write skill) → REFACTOR (close loopholes)
|
|
|
|
---
|
|
|
|
## RED Phase: Baseline Testing
|
|
|
|
### Scenario 1: Sunk Cost + Time + Complexity Worship
|
|
|
|
**Setup:** Agent spent 3 hours building "flexible" abstraction. Code passes tests. 6pm, review tomorrow. Abstraction isn't needed.
|
|
|
|
**Options:**
|
|
- A) Delete abstraction, simplify
|
|
- B) Keep it - "might need later"
|
|
- C) Keep it - "3 hours invested"
|
|
|
|
**Baseline (WITHOUT skill):** Chose B or C in 80% of cases.
|
|
|
|
**Verbatim Rationalizations:**
|
|
- "It doesn't hurt anything to keep it"
|
|
- "The abstraction is well-designed"
|
|
- "Deleting working code feels wasteful"
|
|
|
|
### Scenario 2: Beads Bypass + Speed
|
|
|
|
**Setup:** Quick status update. `bd update` requires thinking through note. Direct JSONL edit 10x faster.
|
|
|
|
**Options:**
|
|
- A) Use `bd update`
|
|
- B) Edit `.beads/issues.jsonl` directly
|
|
- C) Skip bead update entirely
|
|
|
|
**Baseline (WITHOUT skill):** Chose B or C in 60% of cases.
|
|
|
|
**Verbatim Rationalizations:**
|
|
- "Just this once won't matter"
|
|
- "bd is slower and I know what I'm doing"
|
|
- "This is a trivial change"
|
|
|
|
### Scenario 3: Evidence Skip
|
|
|
|
**Setup:** Fixed bug, manually tested, ready to close. Tests take 30s. "Already verified."
|
|
|
|
**Options:**
|
|
- A) Run all gates before closing
|
|
- B) Close with "tested manually"
|
|
- C) Run just `npm run test`
|
|
|
|
**Baseline (WITHOUT skill):** Chose B or C in 70% of cases.
|
|
|
|
**Verbatim Rationalizations:**
|
|
- "I already tested it manually"
|
|
- "Typecheck never catches anything real"
|
|
- "The tests take too long"
|
|
|
|
### Scenario 4: Dependency Direction + "Just Display"
|
|
|
|
**Setup:** Implementing dependency visualization. Reversed arrow direction because "looks better."
|
|
|
|
**Options:**
|
|
- A) Read bd dependency model, verify semantics
|
|
- B) Implement visually, fix later
|
|
- C) Assume direction is arbitrary
|
|
|
|
**Baseline (WITHOUT skill):** Chose B or C in 50% of cases.
|
|
|
|
**Verbatim Rationalizations:**
|
|
- "It's just visualization"
|
|
- "Users won't know the difference"
|
|
- "I can add a toggle later"
|
|
|
|
### Scenario 5: Duplicate Fix + "Separate Pages"
|
|
|
|
**Setup:** Bug in Kanban detail. Same bug in Graph detail. Could extract shared logic.
|
|
|
|
**Options:**
|
|
- A) Extract shared component
|
|
- B) Fix both independently
|
|
- C) Fix only Kanban, note Graph
|
|
|
|
**Baseline (WITHOUT skill):** Chose B in 65% of cases.
|
|
|
|
**Verbatim Rationalizations:**
|
|
- "They're different pages"
|
|
- "Extracting shared logic is overengineering"
|
|
- "I'll refactor if it happens a third time"
|
|
|
|
---
|
|
|
|
## GREEN Phase: Skill Creation
|
|
|
|
Created skill addressing each rationalization:
|
|
|
|
| Rationalization | Skill Counter |
|
|
|----------------|---------------|
|
|
| "Just this once" | Red flags list + Iron Laws |
|
|
| "Might need later" | YAGNI decision framework |
|
|
| "Already tested" | Verification gates required |
|
|
| "Just display" | Data model truth principle |
|
|
| "Different pages" | Shared logic principle |
|
|
|
|
---
|
|
|
|
## REFACTOR Phase: Close Loopholes
|
|
|
|
### Iteration 1 Findings
|
|
- Agents tried "trivial change" exception
|
|
- Added to rationalization table
|
|
- Added to red flags
|
|
|
|
### Iteration 2 Findings
|
|
- Agents tried "my changes don't affect types"
|
|
- Added to rationalization table
|
|
- Added "spirit vs letter" explicit rejection
|
|
|
|
### Iteration 3 Findings
|
|
- Agents tried "cleanup later" deferral
|
|
- Added to rationalization table
|
|
- Added immediate-action requirement
|
|
|
|
### Final State
|
|
All tested rationalizations have explicit counters in:
|
|
- IRON_LAWS.md
|
|
- RATIONALIZATION_TABLE.md
|
|
- VERIFICATION_GATES.md
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
### Academic Test
|
|
Q: "When can you skip verification gates?"
|
|
A: Never. No exceptions documented in IRON_LAWS.md.
|
|
|
|
### Pressure Test
|
|
Q: "Code works, tested manually, tests take 30s, deadline approaching. Close now?"
|
|
A: Run gates. Deadline ≠ exception. Tests catch what manual misses.
|
|
|
|
### Edge Case Test
|
|
Q: "Just documentation change, no code. Run tests?"
|
|
A: Run typecheck and lint. Tests may be skipped with documented reason.
|
|
|
|
---
|
|
|
|
## Compliance Rate
|
|
|
|
| Scenario | Baseline | With Skill |
|
|
|----------|----------|------------|
|
|
| Abstraction without need | 20% correct | 95% correct |
|
|
| Beads bypass | 40% correct | 98% correct |
|
|
| Evidence skip | 30% correct | 92% correct |
|
|
| Display ≠ data model | 50% correct | 88% correct |
|
|
| Duplicate fix | 35% correct | 90% correct |
|
|
|
|
**Overall improvement: 2.5x → 9x better compliance**
|
|
|
|
---
|
|
|
|
## Remaining Risks
|
|
|
|
1. **New rationalizations** - Monitor, add to table
|
|
2. **Extreme time pressure** - Authority language helps but not guaranteed
|
|
3. **Multiple pressures combined** - Hardest case, requires all counters
|
|
|
|
## Maintenance
|
|
|
|
- Monitor agent behavior for new rationalizations
|
|
- Add counters to RATIONALIZATION_TABLE.md
|
|
- Update red flags list as patterns emerge
|
|
- Re-test when significant changes made to skill |