johba b9ba5c9250 docs: agent design principles — determinism/judgment split (#240 )

Design principle for all disinto agents.

## Core idea
Split every agent into two layers:
- **Bash orchestrator (thin, deterministic):** session lifecycle, worktrees, locks, phase monitoring
- **Claude via formula (fat, judgment):** understand task, implement, handle reviews/CI/merge, adapt to novel situations

## Why
Agent scripts grow by accretion — every lesson becomes another if/else in bash. Formulas are refineable, learnable, and generalizable. Bash state machines are not.

## Includes
- Design principle with clear split criteria
- "When reviewing, ask these questions" checklist
- Current state assessment for all 5 agent types
- Risk mitigations (phase protocol as safety net)

Reviewers and planner should be aware of this principle when assessing PRs and planning work.

Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/240
Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>

2026-03-19 09:56:37 +01:00

5.7 KiB

Raw Blame History

Agent Design Principles

Status: Active design principle. All agents, reviewers, and planners should follow this.

The Determinism / Judgment Split

Every agent has two kinds of work. The architecture should separate them cleanly.

Deterministic (bash orchestrator)

Mechanical operations that always work the same way. These belong in bash scripts:

Create and destroy tmux sessions
Create and destroy git worktrees
Phase file watching (the event loop)
Lock files and concurrency guards
Environment setup and teardown
Session lifecycle (start, monitor, kill)

Properties: No judgment required. Never fails differently based on interpretation. Easy to test. Hard to break.

Judgment (Claude via formula)

Operations that require understanding context, making decisions, or adapting to novel situations. These belong in the formula — the prompt Claude executes inside the tmux session:

Read and understand the task (fetch issue body + comments, parse intent)
Assess dependencies ("does the code this depends on actually exist?")
Implement the solution
Create PR with meaningful title and description
Read review feedback, decide what to address vs push back on
Handle CI failures (read logs, decide: fix, retry, or escalate)
Choose rebase strategy (rebase, merge, or start over)
Decide when to refuse vs implement

Properties: Benefits from context. Improves when the formula is refined. Adapts to novel situations without new bash code.

Why This Matters

Today's problem

Agent scripts grow by accretion. Every new lesson becomes another if/elif/else in bash:

"CI failed with this pattern → retry with this flag"
"Review comment mentions X → rebase before addressing"
"Merge conflict in this file → apply this strategy"

This makes agents brittle, hard to modify, and impossible to generalize across projects.

The alternative

A thin bash orchestrator handles session lifecycle. Everything that requires judgment lives in the formula — a structured prompt that Claude interprets. Learnings become formula refinements, not bash patches.

┌─────────────────────────────────────────┐
│ Bash orchestrator (thin, deterministic) │
│                                         │
│  - tmux session lifecycle               │
│  - worktree create/destroy              │
│  - phase file monitoring                │
│  - lock files                           │
│  - environment setup                    │
└────────────────┬────────────────────────┘
                 │ inject formula
                 ▼
┌─────────────────────────────────────────┐
│ Claude in tmux (fat formula, judgment)  │
│                                         │
│  - fetch issue + comments               │
│  - understand task                      │
│  - assess dependencies                  │
│  - implement                            │
│  - create PR                            │
│  - handle review feedback               │
│  - handle CI failures                   │
│  - rebase, merge, or escalate           │
└─────────────────────────────────────────┘

Benefits

Adaptive: Formula refinements propagate instantly. No bash deploy needed.
Learnable: When an agent handles a new situation well, capture it in the formula.
Debuggable: Formula steps are human-readable. Bash state machines are not.
Generalizable: Same orchestrator, different formulas for different agents.

Risks and mitigations

Fragility: Claude might misinterpret a formula step → Phase protocol is the safety net. No phase signal = stall detected = supervisor escalates.
Cost: More Claude turns = more tokens → Offset by eliminating bash dead-ends that waste whole sessions.
Non-determinism: Same formula might produce different results → Success criteria in each step make pass/fail unambiguous.

Applying This Principle

When reviewing PRs or designing new agents, ask:

Does this bash code make a judgment call? → Move it to the formula.
Does this formula step do something mechanical? → Move it to the orchestrator.
Is a new if/else being added to handle an edge case? → That's a formula learning, not an orchestrator feature.
Can this agent's bash be reused by another agent type? → Good sign — the orchestrator is properly thin.

Current State

Agent	Lines	Judgment in bash	Target
dev-agent	1380 (agent 732 + phase-handler 648)	Heavy — deps, CI retry, review parsing, merge strategy, recovery mode	Thin orchestrator + formula
review-agent	870	Heavy — diff analysis, review decision, approve/request-changes logic	Needs assessment
supervisor	877	Heavy — multi-project health checks, CI stall detection, container monitoring	Partially justified (monitoring is deterministic, but escalation decisions are judgment)
gardener	1242 (agent 471 + poll 771)	Medium — backlog triage, duplicate detection, tech-debt scoring	Poll is heavy orchestration; agent is prompt-driven
vault	442 (4 scripts)	Medium — approval flow, human gate decisions	Intentionally bash-heavy (security gate should be deterministic)
planner	382	Medium — AGENTS.md update, gap analysis	Migrating to tmux+formula (#232)
action-agent	192	Light — formula execution	Close to target

5.7 KiB Raw Blame History