disinto/docs/PHASE-PROTOCOL.md
openhands 23949083c0 fix: Remove Matrix integration — notifications move to forge + OpenClaw (#732)
Remove all Matrix/Dendrite infrastructure:
- Delete lib/matrix_listener.sh (long-poll daemon), lib/matrix_listener.service
  (systemd unit), lib/hooks/on-stop-matrix.sh (response streaming hook)
- Remove matrix_send() and matrix_send_ctx() from lib/env.sh
- Remove MATRIX_HOMESERVER auto-detection, MATRIX_THREAD_MAP from lib/env.sh
- Remove [matrix] section parsing from lib/load-project.sh
- Remove Matrix hook installation from lib/agent-session.sh
- Remove notify/notify_ctx helpers and Matrix thread tracking from
  dev/dev-agent.sh and action/action-agent.sh
- Remove all matrix_send calls from dev-poll.sh, phase-handler.sh,
  action-poll.sh, vault-poll.sh, vault-fire.sh, vault-reject.sh,
  review-poll.sh, review-pr.sh, supervisor-poll.sh, formula-session.sh
- Remove Matrix listener startup from docker/agents/entrypoint.sh
- Remove append_dendrite_compose() and setup_matrix() from bin/disinto
- Remove --matrix flag from disinto init
- Clean Matrix references from .env.example, projects/*.toml.example,
  formulas/*.toml, AGENTS.md, BOOTSTRAP.md, README.md, RESOURCES.md,
  PHASE-PROTOCOL.md, and all agent AGENTS.md/PROMPT.md files

Status visibility now via Codeberg PR/issue activity. Human interaction
via vault items through forge. Proactive alerts via OpenClaw heartbeats.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 14:53:56 +00:00

213 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase-Signaling Protocol for Persistent Claude Sessions
## Overview
When dev-agent runs Claude in a persistent tmux session (rather than a
one-shot `claude -p` invocation), Claude needs a way to signal the
orchestrator (`dev-poll.sh`) that a phase has completed.
Claude writes a sentinel line to a **phase file** — a well-known path based
on project name and issue number. The orchestrator watches that file and
reacts accordingly.
## Phase File Path Convention
```
/tmp/dev-session-{project}-{issue}.phase
```
Where:
- `{project}` = the project name from the TOML (`name` field), e.g. `harb`
- `{issue}` = the issue number, e.g. `42`
Example: `/tmp/dev-session-harb-42.phase`
## Phase Values
Claude writes exactly one of these lines to the phase file when a phase ends:
| Sentinel | Meaning | Orchestrator action |
|----------|---------|---------------------|
| `PHASE:awaiting_ci` | PR pushed, waiting for CI to run | Poll CI; inject result when done |
| `PHASE:awaiting_review` | CI passed, PR open, waiting for review | Wait for `review-poll` to inject feedback |
| `PHASE:escalate` | Needs human input (any reason) | Send vault/forge notification; session stays alive; 24h timeout → blocked |
| `PHASE:done` | Work complete, PR merged | Verify merge, kill tmux session, clean up |
| `PHASE:failed` | Unrecoverable failure | Escalate to gardener/supervisor |
### Writing a phase (from within Claude's session)
```bash
PHASE_FILE="/tmp/dev-session-${PROJECT_NAME:-project}-${ISSUE:-0}.phase"
# Signal awaiting CI
echo "PHASE:awaiting_ci" > "$PHASE_FILE"
# Signal awaiting review
echo "PHASE:awaiting_review" > "$PHASE_FILE"
# Signal needs human
echo "PHASE:escalate" > "$PHASE_FILE"
# Signal done
echo "PHASE:done" > "$PHASE_FILE"
# Signal failure
echo "PHASE:failed" > "$PHASE_FILE"
```
The orchestrator reads with:
```bash
phase=$(head -1 "$PHASE_FILE" 2>/dev/null | tr -d '[:space:]')
```
Using `head -1` is required: `PHASE:failed` may have a reason line on line 2,
and reading all lines would produce `PHASE:failedReason:...` which never matches.
## Orchestrator Reaction Matrix
```
PHASE:awaiting_ci → poll CI every 30s
on success → inject "CI passed" into tmux session
on failure → inject CI error log into tmux session
on timeout → inject "CI timeout" + escalate
PHASE:awaiting_review → wait for review-poll.sh to post review comment
on REQUEST_CHANGES → inject review text into session
on APPROVE → inject "approved" into session
on timeout (3h) → inject "no review, escalating"
PHASE:escalate → send vault/forge notification with context (issue/PR link, reason)
session stays alive waiting for human reply
on timeout → 24h: label issue blocked, kill session
PHASE:done → verify PR merged on forge
if merged → kill tmux session, clean labels, close issue
if not → inject "PR not merged yet" into session
PHASE:failed → label issue blocked, post diagnostic comment
kill tmux session
restore backlog label on issue
```
### `idle_prompt` exit reason
`monitor_phase_loop` (in `lib/agent-session.sh`) can exit with
`_MONITOR_LOOP_EXIT=idle_prompt`. This happens when Claude returns to the
interactive prompt (``) for **3 consecutive polls** without writing any phase
signal to the phase file.
**Trigger conditions:**
- The phase file is empty (no phase has ever been written), **and**
- The Stop-hook idle marker (`/tmp/claude-idle-{session}.ts`) is present
(meaning Claude finished a response), **and**
- This state persists across 3 consecutive poll cycles.
**Side-effects:**
1. The tmux session is **killed before** the callback is invoked — callbacks
that handle `PHASE:failed` must not assume the session is alive.
2. The callback is invoked with `PHASE:failed` even though the phase file is
empty. This is the only situation where `PHASE:failed` is passed to the
callback without the phase file actually containing that value.
**Agent requirements:**
- **Callback (`_on_phase_change` / `formula_phase_callback`):** Must handle
`PHASE:failed` defensively — the session is already dead, so any tmux
send-keys or session-dependent logic must be skipped or guarded.
- **Post-loop exit handler (`case $_MONITOR_LOOP_EXIT`):** Must include an
`idle_prompt)` branch. Typical actions: log the event, clean up temp files,
and (for agents that use escalation) write an escalation entry or notify via
vault/forge. See `dev/dev-agent.sh`, `action/action-agent.sh`, and
`gardener/gardener-agent.sh` for reference implementations.
## Crash Recovery
If the tmux session dies (Claude crash, OOM, kernel OOM-kill, compaction):
### Detection
`dev-poll.sh` detects a crash via:
1. `tmux has-session -t "dev-{project}-{issue}"` returns non-zero, OR
2. Phase file is stale (mtime > `CLAUDE_TIMEOUT` seconds with no `PHASE:done`)
### Recovery procedure
```bash
# 1. Read current state from disk
git_diff=$(git -C "$WORKTREE" diff origin/main..HEAD --stat 2>/dev/null)
last_phase=$(head -1 "$PHASE_FILE" 2>/dev/null | tr -d '[:space:]')
last_phase="${last_phase:-PHASE:unknown}"
last_ci=$(cat "/tmp/ci-result-${PROJECT_NAME}-${ISSUE}.txt" 2>/dev/null || echo "")
review_comments=$(curl -sf ... "${API}/issues/${PR}/comments" | jq ...)
# 2. Spawn new tmux session in same worktree
tmux new-session -d -s "dev-${PROJECT_NAME}-${ISSUE}" \
-c "$WORKTREE" \
"claude --dangerously-skip-permissions"
# 3. Inject recovery context
tmux send-keys -t "dev-${PROJECT_NAME}-${ISSUE}" \
"$(cat recovery-prompt.txt)" Enter
```
**Recovery context injected into new session:**
- Issue body (what to implement)
- `git diff` of work done so far (git is the checkpoint, not memory)
- Last known phase (where we left off)
- Last CI result (if phase was `awaiting_ci`)
- Latest review comments (if phase was `awaiting_review`)
**Key principle:** Git is the checkpoint. The worktree persists across crashes.
Claude can read `git log`, `git diff`, and `git status` to understand exactly
what was done before the crash. No state needs to be stored beyond the phase
file and git history.
### State files summary
| File | Created by | Purpose |
|------|-----------|---------|
| `/tmp/dev-session-{proj}-{issue}.phase` | Claude (in session) | Current phase |
| `/tmp/ci-result-{proj}-{issue}.txt` | Orchestrator | Last CI output for injection |
| `/tmp/dev-{proj}-{issue}.log` | Orchestrator | Session transcript (aspirational — path TBD when tmux session manager is implemented in #80) |
| `/tmp/dev-renotify-{proj}-{issue}` | supervisor-poll.sh | Marker to prevent duplicate 6h re-notifications |
| `WORKTREE` (git worktree) | dev-agent.sh | Code checkpoint |
## Sequence Diagram
```
Claude session Orchestrator (dev-poll.sh)
────────────── ──────────────────────────
implement issue
push PR branch
echo "PHASE:awaiting_ci" ───→ read phase file
poll CI
CI passes
←── tmux send-keys "CI passed"
echo "PHASE:awaiting_review" → read phase file
wait for review-poll
review: REQUEST_CHANGES
←── tmux send-keys "Review: ..."
address review comments
push fixes
echo "PHASE:awaiting_review" → read phase file
review: APPROVE
←── tmux send-keys "Approved"
merge PR
echo "PHASE:done" ──────────→ read phase file
verify merged
kill session
close issue
```
## Notes
- The phase file is write-once-per-phase (always overwritten with `>`).
The orchestrator reads it, acts, then waits for the next write.
- Claude should write the phase sentinel **as the last action** of each phase,
after any git push or other side effects are complete.
- If Claude writes `PHASE:failed`, it should include a reason on the next line:
```bash
printf 'PHASE:failed\nReason: %s\n' "$reason" > "$PHASE_FILE"
```
- Phase files are cleaned up by the orchestrator after `PHASE:done` or
`PHASE:failed`.