docs: agent design principles — determinism/judgment split (#240)

Design principle for all disinto agents. ## Core idea Split every agent into two layers: - **Bash orchestrator (thin, deterministic):** session lifecycle, worktrees, locks, phase monitoring - **Claude via formula (fat, judgment):** understand task, implement, handle reviews/CI/merge, adapt to novel situations ## Why Agent scripts grow by accretion — every lesson becomes another if/else in bash. Formulas are refineable, learnable, and generalizable. Bash state machines are not. ## Includes - Design principle with clear split criteria - "When reviewing, ask these questions" checklist - Current state assessment for all 5 agent types - Risk mitigations (phase protocol as safety net) Reviewers and planner should be aware of this principle when assessing PRs and planning work. Co-authored-by: openhands <openhands@all-hands.dev> Reviewed-on: https://codeberg.org/johba/disinto/pulls/240 Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
2026-03-19 09:56:37 +01:00 · 2026-03-19 09:56:37 +01:00 · b9ba5c9250
commit b9ba5c9250
parent a4877b58db
1 changed files with 110 additions and 0 deletions
--- a/docs/AGENT-DESIGN.md
+++ b/docs/AGENT-DESIGN.md
@ -0,0 +1,110 @@
+# Agent Design Principles
+
+> **Status:** Active design principle. All agents, reviewers, and planners should follow this.
+
+## The Determinism / Judgment Split
+
+Every agent has two kinds of work. The architecture should separate them cleanly.
+
+### Deterministic (bash orchestrator)
+
+Mechanical operations that always work the same way. These belong in bash scripts:
+
+- Create and destroy tmux sessions
+- Create and destroy git worktrees
+- Phase file watching (the event loop)
+- Lock files and concurrency guards
+- Environment setup and teardown
+- Session lifecycle (start, monitor, kill)
+
+**Properties:** No judgment required. Never fails differently based on interpretation. Easy to test. Hard to break.
+
+### Judgment (Claude via formula)
+
+Operations that require understanding context, making decisions, or adapting to novel situations. These belong in the formula — the prompt Claude executes inside the tmux session:
+
+- Read and understand the task (fetch issue body + comments, parse intent)
+- Assess dependencies ("does the code this depends on actually exist?")
+- Implement the solution
+- Create PR with meaningful title and description
+- Read review feedback, decide what to address vs push back on
+- Handle CI failures (read logs, decide: fix, retry, or escalate)
+- Choose rebase strategy (rebase, merge, or start over)
+- Decide when to refuse vs implement
+
+**Properties:** Benefits from context. Improves when the formula is refined. Adapts to novel situations without new bash code.
+
+## Why This Matters
+
+### Today's problem
+
+Agent scripts grow by accretion. Every new lesson becomes another `if/elif/else` in bash:
+- "CI failed with this pattern → retry with this flag"
+- "Review comment mentions X → rebase before addressing"
+- "Merge conflict in this file → apply this strategy"
+
+This makes agents brittle, hard to modify, and impossible to generalize across projects.
+
+### The alternative
+
+A thin bash orchestrator handles session lifecycle. Everything that requires judgment lives in the formula — a structured prompt that Claude interprets. Learnings become formula refinements, not bash patches.
+
+```
+┌─────────────────────────────────────────┐
+│ Bash orchestrator (thin, deterministic) │
+│                                         │
+│  - tmux session lifecycle               │
+│  - worktree create/destroy              │
+│  - phase file monitoring                │
+│  - lock files                           │
+│  - environment setup                    │
+└────────────────┬────────────────────────┘
+                 │ inject formula
+                 ▼
+┌─────────────────────────────────────────┐
+│ Claude in tmux (fat formula, judgment)  │
+│                                         │
+│  - fetch issue + comments               │
+│  - understand task                      │
+│  - assess dependencies                  │
+│  - implement                            │
+│  - create PR                            │
+│  - handle review feedback               │
+│  - handle CI failures                   │
+│  - rebase, merge, or escalate           │
+└─────────────────────────────────────────┘
+```
+
+### Benefits
+
+- **Adaptive:** Formula refinements propagate instantly. No bash deploy needed.
+- **Learnable:** When an agent handles a new situation well, capture it in the formula.
+- **Debuggable:** Formula steps are human-readable. Bash state machines are not.
+- **Generalizable:** Same orchestrator, different formulas for different agents.
+
+### Risks and mitigations
+
+- **Fragility:** Claude might misinterpret a formula step → Phase protocol is the safety net. No phase signal = stall detected = supervisor escalates.
+- **Cost:** More Claude turns = more tokens → Offset by eliminating bash dead-ends that waste whole sessions.
+- **Non-determinism:** Same formula might produce different results → Success criteria in each step make pass/fail unambiguous.
+
+## Applying This Principle
+
+When reviewing PRs or designing new agents, ask:
+
+1. **Does this bash code make a judgment call?** → Move it to the formula.
+2. **Does this formula step do something mechanical?** → Move it to the orchestrator.
+3. **Is a new `if/else` being added to handle an edge case?** → That's a formula learning, not an orchestrator feature.
+4. **Can this agent's bash be reused by another agent type?** → Good sign — the orchestrator is properly thin.
+
+## Current State
+
+| Agent | Lines | Judgment in bash | Target |
+|-------|-------|------------------|--------|
+| dev-agent | 1380 (agent 732 + phase-handler 648) | Heavy — deps, CI retry, review parsing, merge strategy, recovery mode | Thin orchestrator + formula |
+| review-agent | 870 | Heavy — diff analysis, review decision, approve/request-changes logic | Needs assessment |
+| supervisor | 877 | Heavy — multi-project health checks, CI stall detection, container monitoring | Partially justified (monitoring is deterministic, but escalation decisions are judgment) |
+| gardener | 1242 (agent 471 + poll 771) | Medium — backlog triage, duplicate detection, tech-debt scoring | Poll is heavy orchestration; agent is prompt-driven |
+| vault | 442 (4 scripts) | Medium — approval flow, human gate decisions | Intentionally bash-heavy (security gate should be deterministic) |
+| planner | 382 | Medium — AGENTS.md update, gap analysis | Migrating to tmux+formula (#232) |
+| action-agent | 192 | Light — formula execution | Close to target |