disinto/docs/PHASE-PROTOCOL.md
openhands 23949083c0 fix: Remove Matrix integration — notifications move to forge + OpenClaw (#732)
Remove all Matrix/Dendrite infrastructure:
- Delete lib/matrix_listener.sh (long-poll daemon), lib/matrix_listener.service
  (systemd unit), lib/hooks/on-stop-matrix.sh (response streaming hook)
- Remove matrix_send() and matrix_send_ctx() from lib/env.sh
- Remove MATRIX_HOMESERVER auto-detection, MATRIX_THREAD_MAP from lib/env.sh
- Remove [matrix] section parsing from lib/load-project.sh
- Remove Matrix hook installation from lib/agent-session.sh
- Remove notify/notify_ctx helpers and Matrix thread tracking from
  dev/dev-agent.sh and action/action-agent.sh
- Remove all matrix_send calls from dev-poll.sh, phase-handler.sh,
  action-poll.sh, vault-poll.sh, vault-fire.sh, vault-reject.sh,
  review-poll.sh, review-pr.sh, supervisor-poll.sh, formula-session.sh
- Remove Matrix listener startup from docker/agents/entrypoint.sh
- Remove append_dendrite_compose() and setup_matrix() from bin/disinto
- Remove --matrix flag from disinto init
- Clean Matrix references from .env.example, projects/*.toml.example,
  formulas/*.toml, AGENTS.md, BOOTSTRAP.md, README.md, RESOURCES.md,
  PHASE-PROTOCOL.md, and all agent AGENTS.md/PROMPT.md files

Status visibility now via Codeberg PR/issue activity. Human interaction
via vault items through forge. Proactive alerts via OpenClaw heartbeats.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 14:53:56 +00:00

8.4 KiB
Raw Permalink Blame History

Phase-Signaling Protocol for Persistent Claude Sessions

Overview

When dev-agent runs Claude in a persistent tmux session (rather than a one-shot claude -p invocation), Claude needs a way to signal the orchestrator (dev-poll.sh) that a phase has completed.

Claude writes a sentinel line to a phase file — a well-known path based on project name and issue number. The orchestrator watches that file and reacts accordingly.

Phase File Path Convention

/tmp/dev-session-{project}-{issue}.phase

Where:

  • {project} = the project name from the TOML (name field), e.g. harb
  • {issue} = the issue number, e.g. 42

Example: /tmp/dev-session-harb-42.phase

Phase Values

Claude writes exactly one of these lines to the phase file when a phase ends:

Sentinel Meaning Orchestrator action
PHASE:awaiting_ci PR pushed, waiting for CI to run Poll CI; inject result when done
PHASE:awaiting_review CI passed, PR open, waiting for review Wait for review-poll to inject feedback
PHASE:escalate Needs human input (any reason) Send vault/forge notification; session stays alive; 24h timeout → blocked
PHASE:done Work complete, PR merged Verify merge, kill tmux session, clean up
PHASE:failed Unrecoverable failure Escalate to gardener/supervisor

Writing a phase (from within Claude's session)

PHASE_FILE="/tmp/dev-session-${PROJECT_NAME:-project}-${ISSUE:-0}.phase"

# Signal awaiting CI
echo "PHASE:awaiting_ci" > "$PHASE_FILE"

# Signal awaiting review
echo "PHASE:awaiting_review" > "$PHASE_FILE"

# Signal needs human
echo "PHASE:escalate" > "$PHASE_FILE"

# Signal done
echo "PHASE:done" > "$PHASE_FILE"

# Signal failure
echo "PHASE:failed" > "$PHASE_FILE"

The orchestrator reads with:

phase=$(head -1 "$PHASE_FILE" 2>/dev/null | tr -d '[:space:]')

Using head -1 is required: PHASE:failed may have a reason line on line 2, and reading all lines would produce PHASE:failedReason:... which never matches.

Orchestrator Reaction Matrix

PHASE:awaiting_ci     → poll CI every 30s
                         on success  → inject "CI passed" into tmux session
                         on failure  → inject CI error log into tmux session
                         on timeout  → inject "CI timeout" + escalate

PHASE:awaiting_review → wait for review-poll.sh to post review comment
                         on REQUEST_CHANGES → inject review text into session
                         on APPROVE         → inject "approved" into session
                         on timeout (3h)    → inject "no review, escalating"

PHASE:escalate        → send vault/forge notification with context (issue/PR link, reason)
                         session stays alive waiting for human reply
                         on timeout → 24h: label issue blocked, kill session

PHASE:done            → verify PR merged on forge
                         if merged   → kill tmux session, clean labels, close issue
                         if not      → inject "PR not merged yet" into session

PHASE:failed          → label issue blocked, post diagnostic comment
                         kill tmux session
                         restore backlog label on issue

idle_prompt exit reason

monitor_phase_loop (in lib/agent-session.sh) can exit with _MONITOR_LOOP_EXIT=idle_prompt. This happens when Claude returns to the interactive prompt () for 3 consecutive polls without writing any phase signal to the phase file.

Trigger conditions:

  • The phase file is empty (no phase has ever been written), and
  • The Stop-hook idle marker (/tmp/claude-idle-{session}.ts) is present (meaning Claude finished a response), and
  • This state persists across 3 consecutive poll cycles.

Side-effects:

  1. The tmux session is killed before the callback is invoked — callbacks that handle PHASE:failed must not assume the session is alive.
  2. The callback is invoked with PHASE:failed even though the phase file is empty. This is the only situation where PHASE:failed is passed to the callback without the phase file actually containing that value.

Agent requirements:

  • Callback (_on_phase_change / formula_phase_callback): Must handle PHASE:failed defensively — the session is already dead, so any tmux send-keys or session-dependent logic must be skipped or guarded.
  • Post-loop exit handler (case $_MONITOR_LOOP_EXIT): Must include an idle_prompt) branch. Typical actions: log the event, clean up temp files, and (for agents that use escalation) write an escalation entry or notify via vault/forge. See dev/dev-agent.sh, action/action-agent.sh, and gardener/gardener-agent.sh for reference implementations.

Crash Recovery

If the tmux session dies (Claude crash, OOM, kernel OOM-kill, compaction):

Detection

dev-poll.sh detects a crash via:

  1. tmux has-session -t "dev-{project}-{issue}" returns non-zero, OR
  2. Phase file is stale (mtime > CLAUDE_TIMEOUT seconds with no PHASE:done)

Recovery procedure

# 1. Read current state from disk
git_diff=$(git -C "$WORKTREE" diff origin/main..HEAD --stat 2>/dev/null)
last_phase=$(head -1 "$PHASE_FILE" 2>/dev/null | tr -d '[:space:]')
last_phase="${last_phase:-PHASE:unknown}"
last_ci=$(cat "/tmp/ci-result-${PROJECT_NAME}-${ISSUE}.txt" 2>/dev/null || echo "")
review_comments=$(curl -sf ... "${API}/issues/${PR}/comments" | jq ...)

# 2. Spawn new tmux session in same worktree
tmux new-session -d -s "dev-${PROJECT_NAME}-${ISSUE}" \
  -c "$WORKTREE" \
  "claude --dangerously-skip-permissions"

# 3. Inject recovery context
tmux send-keys -t "dev-${PROJECT_NAME}-${ISSUE}" \
  "$(cat recovery-prompt.txt)" Enter

Recovery context injected into new session:

  • Issue body (what to implement)
  • git diff of work done so far (git is the checkpoint, not memory)
  • Last known phase (where we left off)
  • Last CI result (if phase was awaiting_ci)
  • Latest review comments (if phase was awaiting_review)

Key principle: Git is the checkpoint. The worktree persists across crashes. Claude can read git log, git diff, and git status to understand exactly what was done before the crash. No state needs to be stored beyond the phase file and git history.

State files summary

File Created by Purpose
/tmp/dev-session-{proj}-{issue}.phase Claude (in session) Current phase
/tmp/ci-result-{proj}-{issue}.txt Orchestrator Last CI output for injection
/tmp/dev-{proj}-{issue}.log Orchestrator Session transcript (aspirational — path TBD when tmux session manager is implemented in #80)
/tmp/dev-renotify-{proj}-{issue} supervisor-poll.sh Marker to prevent duplicate 6h re-notifications
WORKTREE (git worktree) dev-agent.sh Code checkpoint

Sequence Diagram

Claude session                 Orchestrator (dev-poll.sh)
──────────────                 ──────────────────────────
implement issue
push PR branch
echo "PHASE:awaiting_ci" ───→  read phase file
                               poll CI
                               CI passes
                          ←──  tmux send-keys "CI passed"
echo "PHASE:awaiting_review" → read phase file
                               wait for review-poll
                               review: REQUEST_CHANGES
                          ←──  tmux send-keys "Review: ..."
address review comments
push fixes
echo "PHASE:awaiting_review" → read phase file
                               review: APPROVE
                          ←──  tmux send-keys "Approved"
merge PR
echo "PHASE:done" ──────────→  read phase file
                               verify merged
                               kill session
                               close issue

Notes

  • The phase file is write-once-per-phase (always overwritten with >). The orchestrator reads it, acts, then waits for the next write.
  • Claude should write the phase sentinel as the last action of each phase, after any git push or other side effects are complete.
  • If Claude writes PHASE:failed, it should include a reason on the next line:
    printf 'PHASE:failed\nReason: %s\n' "$reason" > "$PHASE_FILE"
    
  • Phase files are cleaned up by the orchestrator after PHASE:done or PHASE:failed.