bug: claude_run_with_watchdog leaks orphan bash children — review-pr.sh lock stuck for 47 min when Claude Bash-tool command hangs #1055
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#1055
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
When Claude's internal Bash tool invokes a command that hangs (e.g. malformed
jqwith misplaced redirection that leaves stdin open, or any child process that blocks on a never-closing FD), theclaude_run_with_watchdoghelper inlib/agent-sdk.shkills only the Claude parent process — not the orphan bash children it spawned. These orphans inherit PID 1, hold the review lockfile indefinitely, and wedge the entire review pipeline until a human finds and kills them.Observed incident (2026-04-19)
Review of PR #1052 hung for 47 minutes (18:58 UTC → 19:42 UTC when I manually killed the tree). During that window, no subsequent review-poll cycles ran — every 5-minute entrypoint iteration started a new
review-poll.sh, each of which exited immediately when it saw the stale/tmp/disinto-review.lock. PR #1053 (opened 19:09) had green CI the whole time but received no review.Stuck process tree when I found it:
Claude itself had exited (Claude's main PID was missing from the tree). Its
bash -cchildren (601243, 601256 — double-fork from Claude's Bash tool) remained alive, blocked insidejq. The malformed command was Claude-generated:Root cause
lib/agent-sdk.sh:claude_run_with_watchdog()(agent-sdk.sh:47-117) spawns Claude as a background child and usestimeout --foreground ... tail --pid="$pid"as the hard ceiling. When the watchdog fires, it does:This kills only the immediate Claude process. Children spawned by Claude's Bash tool are reparented to init (PID 1 =
bash /entrypoint.sh, which doesn't reap them meaningfully) and continue running.The idle-after-result detection path (SIGTERM after grace period, line 82-89) has the same single-pid-kill bug.
Fix
Run Claude in a new process group and signal the whole group on kill:
All three kill call sites (agent-sdk.sh:86, agent-sdk.sh:90, agent-sdk.sh:111-113) need the sign flip. The
tail --pid=waiter can stay as-is — it's just observing the leader.Additionally, in
review/review-pr.sh, add a defensive trap so that if review-pr.sh itself is terminated for any reason, it removes its own lockfile and SIGKILLs any residual children:Acceptance criteria
claude_run_with_watchdogwith a fakeclaudestub that spawns asleep 3600 &child and exits itself. Before the fix: sleep child survives the watchdog. After the fix: sleep child dies when the watchdog fires.setsidis used for the Claude invocation inclaude_run_with_watchdog.claude_run_with_watchdogtarget the process group (-PID), not just the leader.review/review-pr.shhas anEXIT/INT/TERMtrap that removes its lockfile and kills residual children.shellcheckclean on both files.jq ... > file echo written). Verify that withinCLAUDE_TIMEOUT + 10s, the whole tree is dead and/tmp/disinto-review.lockis gone.dev/dev-agent.shpath —pr_walk_to_mergeshould also recover (it uses the sameagent_runhelper).Affected files
lib/agent-sdk.sh—claude_run_with_watchdog()(lines 47-117); three kill sitesreview/review-pr.sh— cleanup trap around line 85 (lock acquisition)tests/— new reproducer test (name e.g.tests/test-watchdog-process-group.sh)Related
Out of scope