disinto/docs/PHASE-PROTOCOL.md

183 lines
6.8 KiB
Markdown
Raw Normal View History

# Phase-Signaling Protocol for Persistent Claude Sessions
## Overview
When dev-agent runs Claude in a persistent tmux session (rather than a
one-shot `claude -p` invocation), Claude needs a way to signal the
orchestrator (`dev-poll.sh`) that a phase has completed.
Claude writes a sentinel line to a **phase file** — a well-known path based
on project name and issue number. The orchestrator watches that file and
reacts accordingly.
## Phase File Path Convention
```
/tmp/dev-session-{project}-{issue}.phase
```
Where:
- `{project}` = the project name from the TOML (`name` field), e.g. `harb`
- `{issue}` = the issue number, e.g. `42`
Example: `/tmp/dev-session-harb-42.phase`
## Phase Values
Claude writes exactly one of these lines to the phase file when a phase ends:
| Sentinel | Meaning | Orchestrator action |
|----------|---------|---------------------|
| `PHASE:awaiting_ci` | PR pushed, waiting for CI to run | Poll CI; inject result when done |
| `PHASE:awaiting_review` | CI passed, PR open, waiting for review | Wait for `review-poll` to inject feedback |
| `PHASE:needs_human` | Blocked on human decision | Send Matrix notification; wait for reply |
| `PHASE:done` | Work complete, PR merged | Verify merge, kill tmux session, clean up |
| `PHASE:failed` | Unrecoverable failure | Escalate to gardener/supervisor |
### Writing a phase (from within Claude's session)
```bash
PHASE_FILE="/tmp/dev-session-${PROJECT_NAME:-project}-${ISSUE:-0}.phase"
# Signal awaiting CI
echo "PHASE:awaiting_ci" > "$PHASE_FILE"
# Signal awaiting review
echo "PHASE:awaiting_review" > "$PHASE_FILE"
# Signal needs human
echo "PHASE:needs_human" > "$PHASE_FILE"
# Signal done
echo "PHASE:done" > "$PHASE_FILE"
# Signal failure
echo "PHASE:failed" > "$PHASE_FILE"
```
The orchestrator reads with:
```bash
phase=$(head -1 "$PHASE_FILE" 2>/dev/null | tr -d '[:space:]')
```
Using `head -1` is required: `PHASE:failed` may have a reason line on line 2,
and reading all lines would produce `PHASE:failedReason:...` which never matches.
## Orchestrator Reaction Matrix
```
PHASE:awaiting_ci → poll CI every 30s
on success → inject "CI passed" into tmux session
on failure → inject CI error log into tmux session
on timeout → inject "CI timeout" + escalate
PHASE:awaiting_review → wait for review-poll.sh to post review comment
on REQUEST_CHANGES → inject review text into session
on APPROVE → inject "approved" into session
on timeout (3h) → inject "no review, escalating"
PHASE:needs_human → send Matrix notification with issue/PR link
on reply → inject human reply into session
on timeout → re-notify, then escalate after 24h
PHASE:done → verify PR merged on Codeberg
if merged → kill tmux session, clean labels, close issue
if not → inject "PR not merged yet" into session
PHASE:failed → write escalation to supervisor/escalations-{project}.jsonl
kill tmux session
restore backlog label on issue
```
## Crash Recovery
If the tmux session dies (Claude crash, OOM, kernel OOM-kill, compaction):
### Detection
`dev-poll.sh` detects a crash via:
1. `tmux has-session -t "dev-{project}-{issue}"` returns non-zero, OR
2. Phase file is stale (mtime > `CLAUDE_TIMEOUT` seconds with no `PHASE:done`)
### Recovery procedure
```bash
# 1. Read current state from disk
git_diff=$(git -C "$WORKTREE" diff origin/main..HEAD --stat 2>/dev/null)
last_phase=$(head -1 "$PHASE_FILE" 2>/dev/null | tr -d '[:space:]')
last_phase="${last_phase:-PHASE:unknown}"
last_ci=$(cat "/tmp/ci-result-${PROJECT_NAME}-${ISSUE}.txt" 2>/dev/null || echo "")
review_comments=$(curl -sf ... "${API}/issues/${PR}/comments" | jq ...)
# 2. Spawn new tmux session in same worktree
tmux new-session -d -s "dev-${PROJECT_NAME}-${ISSUE}" \
-c "$WORKTREE" \
"claude --dangerously-skip-permissions"
# 3. Inject recovery context
tmux send-keys -t "dev-${PROJECT_NAME}-${ISSUE}" \
"$(cat recovery-prompt.txt)" Enter
```
**Recovery context injected into new session:**
- Issue body (what to implement)
- `git diff` of work done so far (git is the checkpoint, not memory)
- Last known phase (where we left off)
- Last CI result (if phase was `awaiting_ci`)
- Latest review comments (if phase was `awaiting_review`)
**Key principle:** Git is the checkpoint. The worktree persists across crashes.
Claude can read `git log`, `git diff`, and `git status` to understand exactly
what was done before the crash. No state needs to be stored beyond the phase
file and git history.
### State files summary
| File | Created by | Purpose |
|------|-----------|---------|
| `/tmp/dev-session-{proj}-{issue}.phase` | Claude (in session) | Current phase |
| `/tmp/ci-result-{proj}-{issue}.txt` | Orchestrator | Last CI output for injection |
| `/tmp/dev-{proj}-{issue}.log` | Orchestrator | Session transcript (aspirational — path TBD when tmux session manager is implemented in #80) |
| `WORKTREE` (git worktree) | dev-agent.sh | Code checkpoint |
## Sequence Diagram
```
Claude session Orchestrator (dev-poll.sh)
────────────── ──────────────────────────
implement issue
push PR branch
echo "PHASE:awaiting_ci" ───→ read phase file
poll CI
CI passes
←── tmux send-keys "CI passed"
echo "PHASE:awaiting_review" → read phase file
wait for review-poll
review: REQUEST_CHANGES
←── tmux send-keys "Review: ..."
address review comments
push fixes
echo "PHASE:awaiting_review" → read phase file
review: APPROVE
←── tmux send-keys "Approved"
merge PR
echo "PHASE:done" ──────────→ read phase file
verify merged
kill session
close issue
```
## Notes
- The phase file is write-once-per-phase (always overwritten with `>`).
The orchestrator reads it, acts, then waits for the next write.
- Claude should write the phase sentinel **as the last action** of each phase,
after any git push or other side effects are complete.
- If Claude writes `PHASE:failed`, it should include a reason on the next line:
```bash
printf 'PHASE:failed\nReason: %s\n' "$reason" > "$PHASE_FILE"
```
- Phase files are cleaned up by the orchestrator after `PHASE:done` or
`PHASE:failed`.