disinto/docs/AGENT-DESIGN.md

# Agent Design Principles

> **Status:** Active design principle. All agents, reviewers, and planners should follow this.

## The Determinism / Judgment Split

Every agent has two kinds of work. The architecture should separate them cleanly.

### Deterministic (bash orchestrator)

Mechanical operations that always work the same way. These belong in bash scripts:

- Create and destroy tmux sessions
- Create and destroy git worktrees
- Phase file watching (the event loop)
- Lock files and concurrency guards
- Environment setup and teardown
- Session lifecycle (start, monitor, kill)

**Properties:** No judgment required. Never fails differently based on interpretation. Easy to test. Hard to break.

### Judgment (Claude via formula)

Operations that require understanding context, making decisions, or adapting to novel situations. These belong in the formula — the prompt Claude executes inside the tmux session:

- Read and understand the task (fetch issue body + comments, parse intent)
- Assess dependencies ("does the code this depends on actually exist?")
- Implement the solution
- Create PR with meaningful title and description
- Read review feedback, decide what to address vs push back on
- Handle CI failures (read logs, decide: fix, retry, or escalate)
- Choose rebase strategy (rebase, merge, or start over)
- Decide when to refuse vs implement

**Properties:** Benefits from context. Improves when the formula is refined. Adapts to novel situations without new bash code.

## Why This Matters

### Today's problem

Agent scripts grow by accretion. Every new lesson becomes another `if/elif/else` in bash:
- "CI failed with this pattern → retry with this flag"
- "Review comment mentions X → rebase before addressing"
- "Merge conflict in this file → apply this strategy"

This makes agents brittle, hard to modify, and impossible to generalize across projects.

### The alternative

A thin bash orchestrator handles session lifecycle. Everything that requires judgment lives in the formula — a structured prompt that Claude interprets. Learnings become formula refinements, not bash patches.

```
┌─────────────────────────────────────────┐
│ Bash orchestrator (thin, deterministic) │
│                                         │
│  - tmux session lifecycle               │
│  - worktree create/destroy              │
│  - phase file monitoring                │
│  - lock files                           │
│  - environment setup                    │
└────────────────┬────────────────────────┘
                 │ inject formula / invoke claude -p
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Judgment layer                             │
│                                                                 │
│  ┌─────────────────────────────┐ ┌───────────────────────────┐  │
│  │ Claude in tmux (interactive)│ │ claude -p (one-shot)      │  │
│  │                             │ │                           │  │
│  │  Multi-turn sessions with   │ │  Single prompt+response.  │  │
│  │  phase protocol, CI/review  │ │  No persistent session.   │  │
│  │  feedback loops, tool use.  │ │  Suited for classify-and- │  │
│  │                             │ │  route decisions that     │  │
│  │  Used by: dev, review,      │ │  don't need interaction.  │  │
│  │  gardener, action, planner, │ │                           │  │
│  │  predictor, supervisor      │ │  Used by: vault           │  │
│  └─────────────────────────────┘ └───────────────────────────┘  │
│                                                                 │
│  Both patterns keep judgment out of bash. Choose based on       │
│  whether the agent needs multi-turn interaction (tmux) or       │
│  a single classify/decide pass (claude -p).                     │
└─────────────────────────────────────────────────────────────────┘
```

### Benefits

- **Adaptive:** Formula refinements propagate instantly. No bash deploy needed.
- **Learnable:** When an agent handles a new situation well, capture it in the formula.
- **Debuggable:** Formula steps are human-readable. Bash state machines are not.
- **Generalizable:** Same orchestrator, different formulas for different agents.

### Risks and mitigations

- **Fragility:** Claude might misinterpret a formula step → Phase protocol is the safety net. No phase signal = stall detected = supervisor escalates.
- **Cost:** More Claude turns = more tokens → Offset by eliminating bash dead-ends that waste whole sessions.
- **Non-determinism:** Same formula might produce different results → Success criteria in each step make pass/fail unambiguous.

## Applying This Principle

When reviewing PRs or designing new agents, ask:

1. **Does this bash code make a judgment call?** → Move it to the formula.
2. **Does this formula step do something mechanical?** → Move it to the orchestrator.
3. **Is a new `if/else` being added to handle an edge case?** → That's a formula learning, not an orchestrator feature.
4. **Can this agent's bash be reused by another agent type?** → Good sign — the orchestrator is properly thin.

## Current State

| Agent | Lines | Judgment in bash | Target |
|-------|-------|------------------|--------|
| dev-agent | 2246 (agent 791 + phase-handler 786 + dev-poll 669) | Heavy — deps, CI retry, review parsing, merge strategy, recovery mode; dev-poll adds dependency resolution, CI retry tracking, approved-PR merging, orphaned session recovery | Thin orchestrator + formula |
| review-agent | 870 | Heavy — diff analysis, review decision, approve/request-changes logic | Needs assessment |
| supervisor | 877 | Heavy — multi-project health checks, CI stall detection, container monitoring | Partially justified (monitoring is deterministic, but escalation decisions are judgment) |
| gardener | 1242 (agent 471 + poll 771) | Medium — backlog triage, duplicate detection, tech-debt scoring | Poll is heavy orchestration; agent is prompt-driven |
| vault | 442 (4 scripts) | Medium — approval flow, human gate decisions | Intentionally bash-heavy (security gate should be deterministic) |
| planner | 382 | Medium — AGENTS.md update, gap analysis | Tmux+formula (done, #232) |
docs: agent design principles — determinism/judgment split (#240) Design principle for all disinto agents. ## Core idea Split every agent into two layers: - Bash orchestrator (thin, deterministic): session lifecycle, worktrees, locks, phase monitoring - Claude via formula (fat, judgment): understand task, implement, handle reviews/CI/merge, adapt to novel situations ## Why Agent scripts grow by accretion — every lesson becomes another if/else in bash. Formulas are refineable, learnable, and generalizable. Bash state machines are not. ## Includes - Design principle with clear split criteria - "When reviewing, ask these questions" checklist - Current state assessment for all 5 agent types - Risk mitigations (phase protocol as safety net) Reviewers and planner should be aware of this principle when assessing PRs and planning work. Co-authored-by: openhands <openhands@all-hands.dev> Reviewed-on: https://codeberg.org/johba/disinto/pulls/240 Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org> 2026-03-19 09:56:37 +01:00			`# Agent Design Principles`

			`> Status: Active design principle. All agents, reviewers, and planners should follow this.`

			`## The Determinism / Judgment Split`

			`Every agent has two kinds of work. The architecture should separate them cleanly.`

			`### Deterministic (bash orchestrator)`

			`Mechanical operations that always work the same way. These belong in bash scripts:`

			`- Create and destroy tmux sessions`
			`- Create and destroy git worktrees`
			`- Phase file watching (the event loop)`
			`- Lock files and concurrency guards`
			`- Environment setup and teardown`
			`- Session lifecycle (start, monitor, kill)`

			`Properties: No judgment required. Never fails differently based on interpretation. Easy to test. Hard to break.`

			`### Judgment (Claude via formula)`

			`Operations that require understanding context, making decisions, or adapting to novel situations. These belong in the formula — the prompt Claude executes inside the tmux session:`

			`- Read and understand the task (fetch issue body + comments, parse intent)`
			`- Assess dependencies ("does the code this depends on actually exist?")`
			`- Implement the solution`
			`- Create PR with meaningful title and description`
			`- Read review feedback, decide what to address vs push back on`
			`- Handle CI failures (read logs, decide: fix, retry, or escalate)`
			`- Choose rebase strategy (rebase, merge, or start over)`
			`- Decide when to refuse vs implement`

			`Properties: Benefits from context. Improves when the formula is refined. Adapts to novel situations without new bash code.`

			`## Why This Matters`

			`### Today's problem`

			Agent scripts grow by accretion. Every new lesson becomes another `if/elif/else` in bash:
			`- "CI failed with this pattern → retry with this flag"`
			`- "Review comment mentions X → rebase before addressing"`
			`- "Merge conflict in this file → apply this strategy"`

			`This makes agents brittle, hard to modify, and impossible to generalize across projects.`

			`### The alternative`

			`A thin bash orchestrator handles session lifecycle. Everything that requires judgment lives in the formula — a structured prompt that Claude interprets. Learnings become formula refinements, not bash patches.`

			```
			`┌─────────────────────────────────────────┐`
			`│ Bash orchestrator (thin, deterministic) │`
			`│ │`
			`│ - tmux session lifecycle │`
			`│ - worktree create/destroy │`
			`│ - phase file monitoring │`
			`│ - lock files │`
			`│ - environment setup │`
			`└────────────────┬────────────────────────┘`
fix: Architecture diagram should acknowledge claude -p as a valid judgment layer alongside tmux (#249) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-21 07:05:34 +00:00			`│ inject formula / invoke claude -p`
docs: agent design principles — determinism/judgment split (#240) Design principle for all disinto agents. ## Core idea Split every agent into two layers: - Bash orchestrator (thin, deterministic): session lifecycle, worktrees, locks, phase monitoring - Claude via formula (fat, judgment): understand task, implement, handle reviews/CI/merge, adapt to novel situations ## Why Agent scripts grow by accretion — every lesson becomes another if/else in bash. Formulas are refineable, learnable, and generalizable. Bash state machines are not. ## Includes - Design principle with clear split criteria - "When reviewing, ask these questions" checklist - Current state assessment for all 5 agent types - Risk mitigations (phase protocol as safety net) Reviewers and planner should be aware of this principle when assessing PRs and planning work. Co-authored-by: openhands <openhands@all-hands.dev> Reviewed-on: https://codeberg.org/johba/disinto/pulls/240 Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org> 2026-03-19 09:56:37 +01:00			`▼`
fix: Architecture diagram should acknowledge claude -p as a valid judgment layer alongside tmux (#249) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-21 07:05:34 +00:00			`┌─────────────────────────────────────────────────────────────────┐`
			`│ Judgment layer │`
			`│ │`
			`│ ┌─────────────────────────────┐ ┌───────────────────────────┐ │`
			`│ │ Claude in tmux (interactive)│ │ claude -p (one-shot) │ │`
			`│ │ │ │ │ │`
			`│ │ Multi-turn sessions with │ │ Single prompt+response. │ │`
			`│ │ phase protocol, CI/review │ │ No persistent session. │ │`
			`│ │ feedback loops, tool use. │ │ Suited for classify-and- │ │`
			`│ │ │ │ route decisions that │ │`
			`│ │ Used by: dev, review, │ │ don't need interaction. │ │`
			`│ │ gardener, action, planner, │ │ │ │`
			`│ │ predictor, supervisor │ │ Used by: vault │ │`
			`│ └─────────────────────────────┘ └───────────────────────────┘ │`
			`│ │`
			`│ Both patterns keep judgment out of bash. Choose based on │`
			`│ whether the agent needs multi-turn interaction (tmux) or │`
			`│ a single classify/decide pass (claude -p). │`
			`└─────────────────────────────────────────────────────────────────┘`
docs: agent design principles — determinism/judgment split (#240) Design principle for all disinto agents. ## Core idea Split every agent into two layers: - Bash orchestrator (thin, deterministic): session lifecycle, worktrees, locks, phase monitoring - Claude via formula (fat, judgment): understand task, implement, handle reviews/CI/merge, adapt to novel situations ## Why Agent scripts grow by accretion — every lesson becomes another if/else in bash. Formulas are refineable, learnable, and generalizable. Bash state machines are not. ## Includes - Design principle with clear split criteria - "When reviewing, ask these questions" checklist - Current state assessment for all 5 agent types - Risk mitigations (phase protocol as safety net) Reviewers and planner should be aware of this principle when assessing PRs and planning work. Co-authored-by: openhands <openhands@all-hands.dev> Reviewed-on: https://codeberg.org/johba/disinto/pulls/240 Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org> 2026-03-19 09:56:37 +01:00			```

			`### Benefits`

			`- Adaptive: Formula refinements propagate instantly. No bash deploy needed.`
			`- Learnable: When an agent handles a new situation well, capture it in the formula.`
			`- Debuggable: Formula steps are human-readable. Bash state machines are not.`
			`- Generalizable: Same orchestrator, different formulas for different agents.`

			`### Risks and mitigations`

			`- Fragility: Claude might misinterpret a formula step → Phase protocol is the safety net. No phase signal = stall detected = supervisor escalates.`
			`- Cost: More Claude turns = more tokens → Offset by eliminating bash dead-ends that waste whole sessions.`
			`- Non-determinism: Same formula might produce different results → Success criteria in each step make pass/fail unambiguous.`

			`## Applying This Principle`

			`When reviewing PRs or designing new agents, ask:`

			`1. Does this bash code make a judgment call? → Move it to the formula.`
			`2. Does this formula step do something mechanical? → Move it to the orchestrator.`
			3. Is a new `if/else` being added to handle an edge case? → That's a formula learning, not an orchestrator feature.
			`4. Can this agent's bash be reused by another agent type? → Good sign — the orchestrator is properly thin.`

			`## Current State`

			`\| Agent \| Lines \| Judgment in bash \| Target \|`
			`\|-------\|-------\|------------------\|--------\|`
fix: dev-poll.sh contains heavy judgment-in-bash not captured in the Current State table (#250) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-21 07:20:12 +00:00			`\| dev-agent \| 2246 (agent 791 + phase-handler 786 + dev-poll 669) \| Heavy — deps, CI retry, review parsing, merge strategy, recovery mode; dev-poll adds dependency resolution, CI retry tracking, approved-PR merging, orphaned session recovery \| Thin orchestrator + formula \|`
docs: agent design principles — determinism/judgment split (#240) Design principle for all disinto agents. ## Core idea Split every agent into two layers: - Bash orchestrator (thin, deterministic): session lifecycle, worktrees, locks, phase monitoring - Claude via formula (fat, judgment): understand task, implement, handle reviews/CI/merge, adapt to novel situations ## Why Agent scripts grow by accretion — every lesson becomes another if/else in bash. Formulas are refineable, learnable, and generalizable. Bash state machines are not. ## Includes - Design principle with clear split criteria - "When reviewing, ask these questions" checklist - Current state assessment for all 5 agent types - Risk mitigations (phase protocol as safety net) Reviewers and planner should be aware of this principle when assessing PRs and planning work. Co-authored-by: openhands <openhands@all-hands.dev> Reviewed-on: https://codeberg.org/johba/disinto/pulls/240 Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org> 2026-03-19 09:56:37 +01:00			`\| review-agent \| 870 \| Heavy — diff analysis, review decision, approve/request-changes logic \| Needs assessment \|`
			`\| supervisor \| 877 \| Heavy — multi-project health checks, CI stall detection, container monitoring \| Partially justified (monitoring is deterministic, but escalation decisions are judgment) \|`
			`\| gardener \| 1242 (agent 471 + poll 771) \| Medium — backlog triage, duplicate detection, tech-debt scoring \| Poll is heavy orchestration; agent is prompt-driven \|`
			`\| vault \| 442 (4 scripts) \| Medium — approval flow, human gate decisions \| Intentionally bash-heavy (security gate should be deterministic) \|`
fix: Architecture diagram should acknowledge claude -p as a valid judgment layer alongside tmux (#249) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-21 07:05:34 +00:00			`\| planner \| 382 \| Medium — AGENTS.md update, gap analysis \| Tmux+formula (done, #232) \|`