From b9ba5c925067fecbadbb3543d715ae559a893aa0 Mon Sep 17 00:00:00 2001 From: johba Date: Thu, 19 Mar 2026 09:56:37 +0100 Subject: [PATCH] =?UTF-8?q?docs:=20agent=20design=20principles=20=E2=80=94?= =?UTF-8?q?=20determinism/judgment=20split=20(#240)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Design principle for all disinto agents. ## Core idea Split every agent into two layers: - **Bash orchestrator (thin, deterministic):** session lifecycle, worktrees, locks, phase monitoring - **Claude via formula (fat, judgment):** understand task, implement, handle reviews/CI/merge, adapt to novel situations ## Why Agent scripts grow by accretion — every lesson becomes another if/else in bash. Formulas are refineable, learnable, and generalizable. Bash state machines are not. ## Includes - Design principle with clear split criteria - "When reviewing, ask these questions" checklist - Current state assessment for all 5 agent types - Risk mitigations (phase protocol as safety net) Reviewers and planner should be aware of this principle when assessing PRs and planning work. Co-authored-by: openhands Reviewed-on: https://codeberg.org/johba/disinto/pulls/240 Reviewed-by: Disinto_bot --- docs/AGENT-DESIGN.md | 110 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 docs/AGENT-DESIGN.md diff --git a/docs/AGENT-DESIGN.md b/docs/AGENT-DESIGN.md new file mode 100644 index 0000000..8d204a2 --- /dev/null +++ b/docs/AGENT-DESIGN.md @@ -0,0 +1,110 @@ +# Agent Design Principles + +> **Status:** Active design principle. All agents, reviewers, and planners should follow this. + +## The Determinism / Judgment Split + +Every agent has two kinds of work. The architecture should separate them cleanly. + +### Deterministic (bash orchestrator) + +Mechanical operations that always work the same way. These belong in bash scripts: + +- Create and destroy tmux sessions +- Create and destroy git worktrees +- Phase file watching (the event loop) +- Lock files and concurrency guards +- Environment setup and teardown +- Session lifecycle (start, monitor, kill) + +**Properties:** No judgment required. Never fails differently based on interpretation. Easy to test. Hard to break. + +### Judgment (Claude via formula) + +Operations that require understanding context, making decisions, or adapting to novel situations. These belong in the formula — the prompt Claude executes inside the tmux session: + +- Read and understand the task (fetch issue body + comments, parse intent) +- Assess dependencies ("does the code this depends on actually exist?") +- Implement the solution +- Create PR with meaningful title and description +- Read review feedback, decide what to address vs push back on +- Handle CI failures (read logs, decide: fix, retry, or escalate) +- Choose rebase strategy (rebase, merge, or start over) +- Decide when to refuse vs implement + +**Properties:** Benefits from context. Improves when the formula is refined. Adapts to novel situations without new bash code. + +## Why This Matters + +### Today's problem + +Agent scripts grow by accretion. Every new lesson becomes another `if/elif/else` in bash: +- "CI failed with this pattern → retry with this flag" +- "Review comment mentions X → rebase before addressing" +- "Merge conflict in this file → apply this strategy" + +This makes agents brittle, hard to modify, and impossible to generalize across projects. + +### The alternative + +A thin bash orchestrator handles session lifecycle. Everything that requires judgment lives in the formula — a structured prompt that Claude interprets. Learnings become formula refinements, not bash patches. + +``` +┌─────────────────────────────────────────┐ +│ Bash orchestrator (thin, deterministic) │ +│ │ +│ - tmux session lifecycle │ +│ - worktree create/destroy │ +│ - phase file monitoring │ +│ - lock files │ +│ - environment setup │ +└────────────────┬────────────────────────┘ + │ inject formula + ▼ +┌─────────────────────────────────────────┐ +│ Claude in tmux (fat formula, judgment) │ +│ │ +│ - fetch issue + comments │ +│ - understand task │ +│ - assess dependencies │ +│ - implement │ +│ - create PR │ +│ - handle review feedback │ +│ - handle CI failures │ +│ - rebase, merge, or escalate │ +└─────────────────────────────────────────┘ +``` + +### Benefits + +- **Adaptive:** Formula refinements propagate instantly. No bash deploy needed. +- **Learnable:** When an agent handles a new situation well, capture it in the formula. +- **Debuggable:** Formula steps are human-readable. Bash state machines are not. +- **Generalizable:** Same orchestrator, different formulas for different agents. + +### Risks and mitigations + +- **Fragility:** Claude might misinterpret a formula step → Phase protocol is the safety net. No phase signal = stall detected = supervisor escalates. +- **Cost:** More Claude turns = more tokens → Offset by eliminating bash dead-ends that waste whole sessions. +- **Non-determinism:** Same formula might produce different results → Success criteria in each step make pass/fail unambiguous. + +## Applying This Principle + +When reviewing PRs or designing new agents, ask: + +1. **Does this bash code make a judgment call?** → Move it to the formula. +2. **Does this formula step do something mechanical?** → Move it to the orchestrator. +3. **Is a new `if/else` being added to handle an edge case?** → That's a formula learning, not an orchestrator feature. +4. **Can this agent's bash be reused by another agent type?** → Good sign — the orchestrator is properly thin. + +## Current State + +| Agent | Lines | Judgment in bash | Target | +|-------|-------|------------------|--------| +| dev-agent | 1380 (agent 732 + phase-handler 648) | Heavy — deps, CI retry, review parsing, merge strategy, recovery mode | Thin orchestrator + formula | +| review-agent | 870 | Heavy — diff analysis, review decision, approve/request-changes logic | Needs assessment | +| supervisor | 877 | Heavy — multi-project health checks, CI stall detection, container monitoring | Partially justified (monitoring is deterministic, but escalation decisions are judgment) | +| gardener | 1242 (agent 471 + poll 771) | Medium — backlog triage, duplicate detection, tech-debt scoring | Poll is heavy orchestration; agent is prompt-driven | +| vault | 442 (4 scripts) | Medium — approval flow, human gate decisions | Intentionally bash-heavy (security gate should be deterministic) | +| planner | 382 | Medium — AGENTS.md update, gap analysis | Migrating to tmux+formula (#232) | +| action-agent | 192 | Light — formula execution | Close to target |