bug: claude_run_with_watchdog leaks orphan bash children — review-pr.sh lock stuck for 47 min when Claude Bash-tool command hangs #1055

New issue

Closed

opened 2026-04-19 19:45:34 +00:00 by disinto-admin · 0 comments

disinto-admin commented

2026-04-19 19:45:34 +00:00

Owner

Problem

When Claude's internal Bash tool invokes a command that hangs (e.g. malformed jq with misplaced redirection that leaves stdin open, or any child process that blocks on a never-closing FD), the claude_run_with_watchdog helper in lib/agent-sdk.sh kills only the Claude parent process — not the orphan bash children it spawned. These orphans inherit PID 1, hold the review lockfile indefinitely, and wedge the entire review pipeline until a human finds and kills them.

Observed incident (2026-04-19)

Review of PR #1052 hung for 47 minutes (18:58 UTC → 19:42 UTC when I manually killed the tree). During that window, no subsequent review-poll cycles ran — every 5-minute entrypoint iteration started a new review-poll.sh, each of which exited immediately when it saw the stale /tmp/disinto-review.lock. PR #1053 (opened 19:09) had green CI the whole time but received no review.

Stuck process tree when I found it:

600417  bash review/review-poll.sh         <- holds /tmp/disinto-review.lock
600474  bash review/review-poll.sh
600475  bash review/review-pr.sh 1052
600669  bash review/review-pr.sh 1052
600672  bash review/review-pr.sh 1052
601243  /bin/bash -c "source snapshot.sh && eval 'jq ... > file.json echo \"written\"'"
601256  /bin/bash -c "source snapshot.sh && eval 'jq ... > file.json echo \"written\"'"

Claude itself had exited (Claude's main PID was missing from the tree). Its bash -c children (601243, 601256 — double-fork from Claude's Bash tool) remained alive, blocked inside jq. The malformed command was Claude-generated:

jq -n --arg verdict "APPROVE" ... '{...}' > /tmp/disinto-review-output-1052.json echo "written"
#                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Dropped `&&` before `echo "written"`. jq treats `echo` and `written` as input
# file arguments; opens them with O_RDONLY on pipes that never close; blocks.

Root cause

lib/agent-sdk.sh:claude_run_with_watchdog() (agent-sdk.sh:47-117) spawns Claude as a background child and uses timeout --foreground ... tail --pid="$pid" as the hard ceiling. When the watchdog fires, it does:

if [ "$rc" -eq 124 ]; then
  kill "$pid" 2>/dev/null || true
  sleep 1
  kill -KILL "$pid" 2>/dev/null || true
fi

This kills only the immediate Claude process. Children spawned by Claude's Bash tool are reparented to init (PID 1 = bash /entrypoint.sh, which doesn't reap them meaningfully) and continue running.

The idle-after-result detection path (SIGTERM after grace period, line 82-89) has the same single-pid-kill bug.

Fix

Run Claude in a new process group and signal the whole group on kill:

# In claude_run_with_watchdog, replace:
"${cmd[@]}" > "$out_file" 2>>"$LOGFILE" &
pid=$!

# With:
setsid "${cmd[@]}" > "$out_file" 2>>"$LOGFILE" &
pid=$!
# setsid makes $pid a process group leader; PGID == PID.

# Then every kill call that targets $pid:
kill "$pid"      →  kill -TERM -- "-$pid"   # negative target = kill group
kill -KILL "$pid" → kill -KILL -- "-$pid"

All three kill call sites (agent-sdk.sh:86, agent-sdk.sh:90, agent-sdk.sh:111-113) need the sign flip. The tail --pid= waiter can stay as-is — it's just observing the leader.

Additionally, in review/review-pr.sh, add a defensive trap so that if review-pr.sh itself is terminated for any reason, it removes its own lockfile and SIGKILLs any residual children:

cleanup_on_exit() {
  local ec=$?
  [ -f "$LOCKFILE" ] && grep -q "^$$$" "$LOCKFILE" 2>/dev/null && rm -f "$LOCKFILE"
  pkill -P $$ 2>/dev/null || true
  exit "$ec"
}
trap cleanup_on_exit EXIT INT TERM

Acceptance criteria

Create a deliberate reproducer: a test that runs claude_run_with_watchdog with a fake claude stub that spawns a sleep 3600 & child and exits itself. Before the fix: sleep child survives the watchdog. After the fix: sleep child dies when the watchdog fires.
setsid is used for the Claude invocation in claude_run_with_watchdog.
All three kill call sites in claude_run_with_watchdog target the process group (-PID), not just the leader.
review/review-pr.sh has an EXIT/INT/TERM trap that removes its lockfile and kills residual children.
shellcheck clean on both files.
Regression scenario: deliberately inject the malformed jq command from the incident (write a test prompt that tells a fake Claude to emit jq ... > file echo written). Verify that within CLAUDE_TIMEOUT + 10s, the whole tree is dead and /tmp/disinto-review.lock is gone.
Same test run through dev/dev-agent.sh path — pr_walk_to_merge should also recover (it uses the same agent_run helper).

Affected files

lib/agent-sdk.sh — claude_run_with_watchdog() (lines 47-117); three kill sites
review/review-pr.sh — cleanup trap around line 85 (lock acquisition)
tests/ — new reproducer test (name e.g. tests/test-watchdog-process-group.sh)

#591 — upstream Claude Code hang that motivated the watchdog originally. This issue is a gap in that fix's blast radius.
#1044 — woodpecker-agent log truncation; a stuck Claude that produces no final output would also benefit from reliable process-group cleanup.

Out of scope

Fixing Claude's prompt quality to avoid generating malformed commands in the first place. That's a separate concern; the watchdog is the last line of defense.
Moving review-pr.sh to a supervisor model (systemd/s6). Just need process-group plumbing in the existing shell script.

## Problem When Claude's internal Bash tool invokes a command that hangs (e.g. malformed `jq` with misplaced redirection that leaves stdin open, or any child process that blocks on a never-closing FD), the `claude_run_with_watchdog` helper in `lib/agent-sdk.sh` kills only the Claude parent process — not the orphan bash children it spawned. These orphans inherit PID 1, hold the review lockfile indefinitely, and wedge the entire review pipeline until a human finds and kills them. ## Observed incident (2026-04-19) Review of PR #1052 hung for **47 minutes** (18:58 UTC → 19:42 UTC when I manually killed the tree). During that window, no subsequent review-poll cycles ran — every 5-minute entrypoint iteration started a new `review-poll.sh`, each of which exited immediately when it saw the stale `/tmp/disinto-review.lock`. PR #1053 (opened 19:09) had green CI the whole time but received no review. Stuck process tree when I found it: ``` 600417 bash review/review-poll.sh <- holds /tmp/disinto-review.lock 600474 bash review/review-poll.sh 600475 bash review/review-pr.sh 1052 600669 bash review/review-pr.sh 1052 600672 bash review/review-pr.sh 1052 601243 /bin/bash -c "source snapshot.sh && eval 'jq ... > file.json echo \"written\"'" 601256 /bin/bash -c "source snapshot.sh && eval 'jq ... > file.json echo \"written\"'" ``` Claude itself had exited (Claude's main PID was missing from the tree). Its `bash -c` children (601243, 601256 — double-fork from Claude's Bash tool) remained alive, blocked inside `jq`. The malformed command was Claude-generated: ```bash jq -n --arg verdict "APPROVE" ... '{...}' > /tmp/disinto-review-output-1052.json echo "written" # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Dropped `&&` before `echo "written"`. jq treats `echo` and `written` as input # file arguments; opens them with O_RDONLY on pipes that never close; blocks. ``` ## Root cause `lib/agent-sdk.sh:claude_run_with_watchdog()` ([agent-sdk.sh:47-117](lib/agent-sdk.sh)) spawns Claude as a background child and uses `timeout --foreground ... tail --pid="$pid"` as the hard ceiling. When the watchdog fires, it does: ```bash if [ "$rc" -eq 124 ]; then kill "$pid" 2>/dev/null || true sleep 1 kill -KILL "$pid" 2>/dev/null || true fi ``` This kills only the immediate Claude process. Children spawned by Claude's Bash tool are reparented to init (PID 1 = `bash /entrypoint.sh`, which doesn't reap them meaningfully) and continue running. The idle-after-result detection path (SIGTERM after grace period, line 82-89) has the same single-pid-kill bug. ## Fix Run Claude in a **new process group** and signal the whole group on kill: ```bash # In claude_run_with_watchdog, replace: "${cmd[@]}" > "$out_file" 2>>"$LOGFILE" & pid=$! # With: setsid "${cmd[@]}" > "$out_file" 2>>"$LOGFILE" & pid=$! # setsid makes $pid a process group leader; PGID == PID. # Then every kill call that targets $pid: kill "$pid" → kill -TERM -- "-$pid" # negative target = kill group kill -KILL "$pid" → kill -KILL -- "-$pid" ``` All three kill call sites ([agent-sdk.sh:86](lib/agent-sdk.sh), [agent-sdk.sh:90](lib/agent-sdk.sh), [agent-sdk.sh:111-113](lib/agent-sdk.sh)) need the sign flip. The `tail --pid=` waiter can stay as-is — it's just observing the leader. Additionally, in `review/review-pr.sh`, add a defensive trap so that if review-pr.sh itself is terminated for any reason, it removes its own lockfile and SIGKILLs any residual children: ```bash cleanup_on_exit() { local ec=$? [ -f "$LOCKFILE" ] && grep -q "^$$$" "$LOCKFILE" 2>/dev/null && rm -f "$LOCKFILE" pkill -P $$ 2>/dev/null || true exit "$ec" } trap cleanup_on_exit EXIT INT TERM ``` ## Acceptance criteria - [ ] Create a deliberate reproducer: a test that runs `claude_run_with_watchdog` with a fake `claude` stub that spawns a `sleep 3600 &` child and exits itself. Before the fix: sleep child survives the watchdog. After the fix: sleep child dies when the watchdog fires. - [ ] `setsid` is used for the Claude invocation in `claude_run_with_watchdog`. - [ ] All three kill call sites in `claude_run_with_watchdog` target the process group (`-PID`), not just the leader. - [ ] `review/review-pr.sh` has an `EXIT`/`INT`/`TERM` trap that removes its lockfile and kills residual children. - [ ] `shellcheck` clean on both files. - [ ] Regression scenario: deliberately inject the malformed jq command from the incident (write a test prompt that tells a fake Claude to emit `jq ... > file echo written`). Verify that within `CLAUDE_TIMEOUT + 10s`, the whole tree is dead and `/tmp/disinto-review.lock` is gone. - [ ] Same test run through `dev/dev-agent.sh` path — `pr_walk_to_merge` should also recover (it uses the same `agent_run` helper). ## Affected files - `lib/agent-sdk.sh` — `claude_run_with_watchdog()` (lines 47-117); three kill sites - `review/review-pr.sh` — cleanup trap around line 85 (lock acquisition) - `tests/` — new reproducer test (name e.g. `tests/test-watchdog-process-group.sh`) ## Related - #591 — upstream Claude Code hang that motivated the watchdog originally. This issue is a gap in that fix's blast radius. - #1044 — woodpecker-agent log truncation; a stuck Claude that produces no final output would also benefit from reliable process-group cleanup. ## Out of scope - Fixing Claude's prompt quality to avoid generating malformed commands in the first place. That's a separate concern; the watchdog is the last line of defense. - Moving review-pr.sh to a supervisor model (systemd/s6). Just need process-group plumbing in the existing shell script.