fix: bug: claude_run_with_watchdog leaks orphan bash children — review-pr.sh lock stuck for 47 min when Claude Bash-tool command hangs (#1055) #1056
No reviewers
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#1056
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/issue-1055"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #1055
Changes
AI Review
Summary
The fix is correct and well-targeted.
setsidmakes claude the process-group leader;kill -- -$pid(negative PID) sends to the entire process group, killing all descendants including hanging Bash-tool children. The 47-minute lockfile stall cannot recur with this change in place.Findings
low
lib/agent-sdk.sh:105:kill -- "-$grace_pid"targets process group PGID=$grace_pid, butgrace_pidis started with( ) &in a non-interactive script and inherits the parent PGID — it is not a process-group leader. The kill fails silently (|| true), leaving cleanup to thewait. In the worst case (watchdog mid-sleep in the grace period whentimeoutfires),waitblocks for up toCLAUDE_IDLE_GRACEseconds before the watchdog's poll loop detects thatpidis gone and exits. The originalkill "$grace_pid"(direct kill) was more precise. Not a correctness bug — the watchdog always self-terminates whenpiddisappears — but worth noting.low
review/review-pr.sh:81-83: Comment says "When review succeeds, we want to skip lockfile removal since the verdict was posted" but the code immediately below (and at the end of the script) does remove the lockfile on success. The comment is misleading and contradicts the implementation.Checklist
setsid→ PGID leader →kill -- -$pidcovers all descendants. Correct.kill -- "-$pid". Correct.rm -f "$LOCKFILE". Correct.cleanup_on_exit: guards against a restarted process removing a foreign lockfile. Correct.rm -fat end of script + EXIT trap both callrm -f; the second is a no-op. Fine.sleep 3600, writes result marker, stays alive; test verifies zero orphansleep 3600processes after watchdog fires. Scenario matches the reported bug.Verdict
APPROVE — Process group kill correctly addresses the orphan-child root cause; lockfile cleanup paths are all closed; minor non-blocking observations on grace_pid kill and a misleading comment.
Reviewed at
f878427| AGENTS.mdAI Review: APPROVE — Process group kill correctly addresses the orphan-child root cause; lockfile cleanup paths are all closed; minor non-blocking observations on grace_pid kill and a misleading comment.