fix: feat: StopFailure hook writes phase file on API error / rate limit (#275)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
109758e86b
commit
eaf2841494
4 changed files with 127 additions and 3 deletions
|
|
@ -267,7 +267,7 @@ sourced as needed.
|
|||
| `lib/load-project.sh` | Parses a `projects/*.toml` file into env vars (`PROJECT_NAME`, `CODEBERG_REPO`, `WOODPECKER_REPO_ID`, monitoring toggles, Matrix config, etc.). | env.sh (when `PROJECT_TOML` is set), supervisor-poll (per-project iteration) |
|
||||
| `lib/parse-deps.sh` | Extracts dependency issue numbers from an issue body (stdin → stdout, one number per line). Matches `## Dependencies` / `## Depends on` / `## Blocked by` sections and inline `depends on #N` patterns. Not sourced — executed via `bash lib/parse-deps.sh`. | dev-poll, supervisor-poll |
|
||||
| `lib/matrix_listener.sh` | Long-poll Matrix sync daemon. Dispatches thread replies to the correct agent via well-known files (`/tmp/{agent}-escalation-reply`). Handles supervisor, gardener, dev, review, vault, and action reply routing. Run as systemd service. | Standalone daemon |
|
||||
| `lib/agent-session.sh` | Shared tmux + Claude session helpers: `create_agent_session()`, `inject_formula()`, `agent_wait_for_claude_ready()`, `agent_inject_into_session()`, `agent_kill_session()`, `monitor_phase_loop()`, `read_phase()`. `create_agent_session(session, workdir, [phase_file])` optionally installs a PostToolUse hook (matcher `Bash\|Write`) that detects phase file writes in real-time — when Claude writes to the phase file, the hook writes a marker so `monitor_phase_loop` reacts on the next poll instead of waiting for mtime changes. When `MATRIX_THREAD_ID` is exported, also installs a Stop hook (`on-stop-matrix.sh`) that streams each Claude response to the Matrix thread. `monitor_phase_loop` sets `_MONITOR_LOOP_EXIT` to one of: `done`, `idle_timeout`, `idle_prompt` (Claude returned to `❯` for 3 consecutive polls without writing any phase — callback invoked with `PHASE:failed`, session already dead), `crashed`, or a `PHASE:*` string. Agents must handle `idle_prompt` in both their callback and their post-loop exit handler. | dev-agent.sh, gardener-agent.sh, action-agent.sh |
|
||||
| `lib/agent-session.sh` | Shared tmux + Claude session helpers: `create_agent_session()`, `inject_formula()`, `agent_wait_for_claude_ready()`, `agent_inject_into_session()`, `agent_kill_session()`, `monitor_phase_loop()`, `read_phase()`. `create_agent_session(session, workdir, [phase_file])` optionally installs a PostToolUse hook (matcher `Bash\|Write`) that detects phase file writes in real-time — when Claude writes to the phase file, the hook writes a marker so `monitor_phase_loop` reacts on the next poll instead of waiting for mtime changes. Also installs a StopFailure hook (matcher `rate_limit\|server_error\|authentication_failed\|billing_error`) that writes `PHASE:failed` with an `api_error` reason to the phase file and touches the phase-changed marker, so the orchestrator discovers API errors within one poll cycle instead of waiting for idle timeout. When `MATRIX_THREAD_ID` is exported, also installs a Stop hook (`on-stop-matrix.sh`) that streams each Claude response to the Matrix thread. `monitor_phase_loop` sets `_MONITOR_LOOP_EXIT` to one of: `done`, `idle_timeout`, `idle_prompt` (Claude returned to `❯` for 3 consecutive polls without writing any phase — callback invoked with `PHASE:failed`, session already dead), `crashed`, or a `PHASE:*` string. Agents must handle `idle_prompt` in both their callback and their post-loop exit handler. | dev-agent.sh, gardener-agent.sh, action-agent.sh |
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
|
|
@ -226,7 +226,64 @@ else
|
|||
fail "PostToolUse hook script not found or not executable: $HOOK_SCRIPT"
|
||||
fi
|
||||
|
||||
# ── Test 10: phase-changed marker resets mtime guard ─────────────────────
|
||||
# ── Test 10: StopFailure hook writes phase file and marker on API error ───
|
||||
STOP_FAILURE_HOOK="$(dirname "$0")/../lib/hooks/on-stop-failure.sh"
|
||||
SF_MARKER="/tmp/phase-changed-test-sf.marker"
|
||||
rm -f "$SF_MARKER" "$PHASE_FILE"
|
||||
|
||||
if [ -x "$STOP_FAILURE_HOOK" ]; then
|
||||
# 10a: rate_limit stop reason → PHASE:failed with api_error reason
|
||||
printf '{"stop_reason":"rate_limit"}' \
|
||||
| "$STOP_FAILURE_HOOK" "$PHASE_FILE" "$SF_MARKER"
|
||||
sf_first=$(head -1 "$PHASE_FILE" 2>/dev/null)
|
||||
sf_second=$(sed -n '2p' "$PHASE_FILE" 2>/dev/null)
|
||||
if [ "$sf_first" = "PHASE:failed" ] && echo "$sf_second" | grep -q "api_error: rate_limit"; then
|
||||
ok "StopFailure hook writes PHASE:failed with api_error: rate_limit"
|
||||
else
|
||||
fail "StopFailure hook phase file: first='$sf_first' second='$sf_second'"
|
||||
fi
|
||||
if [ -f "$SF_MARKER" ]; then
|
||||
ok "StopFailure hook writes phase-changed marker"
|
||||
else
|
||||
fail "StopFailure hook did not write phase-changed marker"
|
||||
fi
|
||||
rm -f "$SF_MARKER" "$PHASE_FILE"
|
||||
|
||||
# 10b: server_error stop reason
|
||||
printf '{"stop_reason":"server_error"}' \
|
||||
| "$STOP_FAILURE_HOOK" "$PHASE_FILE" "$SF_MARKER"
|
||||
sf_second=$(sed -n '2p' "$PHASE_FILE" 2>/dev/null)
|
||||
if echo "$sf_second" | grep -q "api_error: server_error"; then
|
||||
ok "StopFailure hook writes api_error: server_error"
|
||||
else
|
||||
fail "StopFailure hook server_error: got '$sf_second'"
|
||||
fi
|
||||
rm -f "$SF_MARKER" "$PHASE_FILE"
|
||||
|
||||
# 10c: missing phase_file arg → no-op (exit 0, no crash)
|
||||
printf '{"stop_reason":"rate_limit"}' | "$STOP_FAILURE_HOOK" "" "$SF_MARKER"
|
||||
if [ ! -f "$PHASE_FILE" ]; then
|
||||
ok "StopFailure hook no-ops when phase_file is empty"
|
||||
else
|
||||
fail "StopFailure hook should not write when phase_file is empty"
|
||||
fi
|
||||
rm -f "$SF_MARKER"
|
||||
|
||||
# 10d: missing marker arg → phase file still written, no marker
|
||||
printf '{"stop_reason":"billing_error"}' \
|
||||
| "$STOP_FAILURE_HOOK" "$PHASE_FILE" ""
|
||||
sf_first=$(head -1 "$PHASE_FILE" 2>/dev/null)
|
||||
if [ "$sf_first" = "PHASE:failed" ] && [ ! -f "$SF_MARKER" ]; then
|
||||
ok "StopFailure hook writes phase without marker when marker arg is empty"
|
||||
else
|
||||
fail "StopFailure hook: first='$sf_first' marker_exists=$([ -f "$SF_MARKER" ] && echo yes || echo no)"
|
||||
fi
|
||||
rm -f "$PHASE_FILE"
|
||||
else
|
||||
fail "StopFailure hook script not found or not executable: $STOP_FAILURE_HOOK"
|
||||
fi
|
||||
|
||||
# ── Test 11: phase-changed marker resets mtime guard ─────────────────────
|
||||
# Simulates monitor_phase_loop behavior: when marker exists, last_mtime
|
||||
# is reset to 0 so the phase is processed even if mtime hasn't changed.
|
||||
echo "PHASE:awaiting_ci" > "$PHASE_FILE"
|
||||
|
|
@ -241,7 +298,7 @@ else
|
|||
fi
|
||||
|
||||
# Now simulate marker present — reset last_mtime to 0
|
||||
MARKER_FILE="/tmp/phase-changed-test-session.marker"
|
||||
MARKER_FILE="/tmp/phase-changed-test-mtime.marker"
|
||||
date +%s > "$MARKER_FILE"
|
||||
if [ -f "$MARKER_FILE" ]; then
|
||||
rm -f "$MARKER_FILE"
|
||||
|
|
|
|||
|
|
@ -47,6 +47,7 @@ agent_inject_into_session() {
|
|||
# Installs a Stop hook for idle detection (see monitor_phase_loop).
|
||||
# Installs a PreToolUse hook to guard destructive Bash operations.
|
||||
# Optionally installs a PostToolUse hook for phase file write detection.
|
||||
# Optionally installs a StopFailure hook for immediate phase file update on API error.
|
||||
# Args: session workdir [phase_file]
|
||||
# Returns 0 if session is ready, 1 otherwise.
|
||||
create_agent_session() {
|
||||
|
|
@ -121,6 +122,38 @@ create_agent_session() {
|
|||
fi
|
||||
fi
|
||||
|
||||
# Install StopFailure hook for immediate phase file update on API error:
|
||||
# when Claude hits a rate limit, server error, billing error, or auth failure,
|
||||
# the hook writes PHASE:failed to the phase file and touches the phase-changed
|
||||
# marker so monitor_phase_loop picks it up within one poll cycle instead of
|
||||
# waiting for idle timeout (up to 2 hours).
|
||||
if [ -n "$phase_file" ]; then
|
||||
local stop_failure_hook_script="${FACTORY_ROOT}/lib/hooks/on-stop-failure.sh"
|
||||
if [ -x "$stop_failure_hook_script" ]; then
|
||||
local stop_failure_hook_cmd="${stop_failure_hook_script} ${phase_file} ${phase_marker}"
|
||||
if [ -f "$settings" ]; then
|
||||
jq --arg cmd "$stop_failure_hook_cmd" '
|
||||
if (.hooks.StopFailure // [] | any(.[]; .hooks[]?.command == $cmd))
|
||||
then .
|
||||
else .hooks.StopFailure = (.hooks.StopFailure // []) + [{
|
||||
matcher: "rate_limit|server_error|authentication_failed|billing_error",
|
||||
hooks: [{type: "command", command: $cmd}]
|
||||
}]
|
||||
end
|
||||
' "$settings" > "${settings}.tmp" && mv "${settings}.tmp" "$settings"
|
||||
else
|
||||
jq -n --arg cmd "$stop_failure_hook_cmd" '{
|
||||
hooks: {
|
||||
StopFailure: [{
|
||||
matcher: "rate_limit|server_error|authentication_failed|billing_error",
|
||||
hooks: [{type: "command", command: $cmd}]
|
||||
}]
|
||||
}
|
||||
}' > "$settings"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# Install PreToolUse hook for destructive operation guard: blocks force push
|
||||
# to primary branch, rm -rf outside worktree, direct API merge calls, and
|
||||
# checkout/switch to primary branch. Claude sees the denial reason on exit 2
|
||||
|
|
|
|||
34
lib/hooks/on-stop-failure.sh
Executable file
34
lib/hooks/on-stop-failure.sh
Executable file
|
|
@ -0,0 +1,34 @@
|
|||
#!/bin/bash
|
||||
# on-stop-failure.sh — StopFailure hook for immediate phase file update on API error.
|
||||
#
|
||||
# Called by Claude Code when a turn ends due to an API error (rate limit,
|
||||
# server error, billing error, authentication failure). Writes PHASE:failed
|
||||
# to the phase file and touches the phase-changed marker so the orchestrator
|
||||
# picks up the failure within one poll cycle instead of waiting for idle
|
||||
# timeout (up to 2 hours).
|
||||
#
|
||||
# Usage (in .claude/settings.json):
|
||||
# {"type": "command", "command": "this-script /path/to/phase-file /path/to/marker"}
|
||||
#
|
||||
# Args: $1 = phase file path, $2 = phase-changed marker path
|
||||
|
||||
phase_file="${1:-}"
|
||||
marker_file="${2:-}"
|
||||
|
||||
[ -z "$phase_file" ] && exit 0
|
||||
|
||||
input=$(cat) # consume hook JSON from stdin
|
||||
|
||||
# Extract the stop reason from the hook payload
|
||||
reason=$(printf '%s' "$input" | jq -r '
|
||||
.stop_reason // .matched_hook // .reason // .type // "unknown"
|
||||
' 2>/dev/null)
|
||||
[ -z "$reason" ] && reason="unknown"
|
||||
|
||||
# Write phase file immediately — orchestrator reads first line as phase sentinel
|
||||
printf 'PHASE:failed\nReason: api_error: %s\n' "$reason" > "$phase_file"
|
||||
|
||||
# Touch marker so monitor_phase_loop picks this up on the next poll cycle
|
||||
if [ -n "$marker_file" ]; then
|
||||
date +%s > "$marker_file"
|
||||
fi
|
||||
Loading…
Add table
Add a link
Reference in a new issue