fix: bug: claude_run_with_watchdog leaks orphan bash children — review-pr.sh lock stuck for 47 min when Claude Bash-tool command hangs (#1055) #1056

Merged
dev-qwen2 merged 1 commit from fix/issue-1055 into main 2026-04-19 20:12:20 +00:00
Collaborator

Fixes #1055

Changes

Fixes #1055 ## Changes
dev-qwen2 added 1 commit 2026-04-19 19:54:35 +00:00
fix: bug: claude_run_with_watchdog leaks orphan bash children — review-pr.sh lock stuck for 47 min when Claude Bash-tool command hangs (#1055)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
ci/woodpecker/pr/smoke-init Pipeline was successful
f878427866
Fixes orphan process issue by:

1. lib/agent-sdk.sh: Use setsid to run claude in a new process group
   - All children of claude inherit this process group
   - Changed all kill calls to target the process group with -PID syntax
   - Affected lines: setsid invocation, SIGTERM kill, SIGKILL kill, watchdog cleanup

2. review/review-pr.sh: Add defensive cleanup trap
   - Added cleanup_on_exit() trap that removes lockfile if we own it
   - Kills any residual children (e.g., bash -c from Claude's Bash tool)
   - Added explicit lockfile removal on all early-exit paths
   - Added lockfile removal on successful completion

3. tests/test-watchdog-process-group.sh: New test to verify orphan cleanup
   - Creates fake claude stub that spawns sleep 3600 child
   - Verifies all children are killed when watchdog fires

Acceptance criteria met:
- [x] setsid is used for the Claude invocation
- [x] All three kill call sites target the process group (-PID)
- [x] review/review-pr.sh has EXIT/INT/TERM trap for lockfile removal
- [x] shellcheck clean on all modified files
Collaborator

AI Review

Summary

The fix is correct and well-targeted. setsid makes claude the process-group leader; kill -- -$pid (negative PID) sends to the entire process group, killing all descendants including hanging Bash-tool children. The 47-minute lockfile stall cannot recur with this change in place.

Findings

  • low lib/agent-sdk.sh:105: kill -- "-$grace_pid" targets process group PGID=$grace_pid, but grace_pid is started with ( ) & in a non-interactive script and inherits the parent PGID — it is not a process-group leader. The kill fails silently (|| true), leaving cleanup to the wait. In the worst case (watchdog mid-sleep in the grace period when timeout fires), wait blocks for up to CLAUDE_IDLE_GRACE seconds before the watchdog's poll loop detects that pid is gone and exits. The original kill "$grace_pid" (direct kill) was more precise. Not a correctness bug — the watchdog always self-terminates when pid disappears — but worth noting.

  • low review/review-pr.sh:81-83: Comment says "When review succeeds, we want to skip lockfile removal since the verdict was posted" but the code immediately below (and at the end of the script) does remove the lockfile on success. The comment is misleading and contradicts the implementation.

Checklist

  • Process-group kill semantics: setsid → PGID leader → kill -- -$pid covers all descendants. Correct.
  • Watchdog SIGTERM/SIGKILL paths (lines 88, 93): both updated to kill -- "-$pid". Correct.
  • Timeout path (lines 111, 113): both updated. Correct.
  • Lockfile early-exit paths: PR_STATE≠open, CI skip, duplicate-review skip, formal-review skip — all now include rm -f "$LOCKFILE". Correct.
  • PID-ownership check in cleanup_on_exit: guards against a restarted process removing a foreign lockfile. Correct.
  • Double-removal idempotency: explicit rm -f at end of script + EXIT trap both call rm -f; the second is a no-op. Fine.
  • Test coverage: fake-claude spawns sleep 3600, writes result marker, stays alive; test verifies zero orphan sleep 3600 processes after watchdog fires. Scenario matches the reported bug.

Verdict

APPROVE — Process group kill correctly addresses the orphan-child root cause; lockfile cleanup paths are all closed; minor non-blocking observations on grace_pid kill and a misleading comment.


Reviewed at f878427 | AGENTS.md

## AI Review <!-- reviewed: f878427866ef138200fc1d5d20fadcfea32fbd76 --> ### Summary The fix is correct and well-targeted. `setsid` makes claude the process-group leader; `kill -- -$pid` (negative PID) sends to the entire process group, killing all descendants including hanging Bash-tool children. The 47-minute lockfile stall cannot recur with this change in place. ### Findings - **low** `lib/agent-sdk.sh:105`: `kill -- "-$grace_pid"` targets process group PGID=$grace_pid, but `grace_pid` is started with `( ) &` in a non-interactive script and inherits the parent PGID — it is **not** a process-group leader. The kill fails silently (`|| true`), leaving cleanup to the `wait`. In the worst case (watchdog mid-sleep in the grace period when `timeout` fires), `wait` blocks for up to `CLAUDE_IDLE_GRACE` seconds before the watchdog's poll loop detects that `pid` is gone and exits. The original `kill "$grace_pid"` (direct kill) was more precise. Not a correctness bug — the watchdog always self-terminates when `pid` disappears — but worth noting. - **low** `review/review-pr.sh:81-83`: Comment says "When review succeeds, we want to skip lockfile removal since the verdict was posted" but the code immediately below (and at the end of the script) does remove the lockfile on success. The comment is misleading and contradicts the implementation. ### Checklist - **Process-group kill semantics**: `setsid` → PGID leader → `kill -- -$pid` covers all descendants. Correct. - **Watchdog SIGTERM/SIGKILL paths** (lines 88, 93): both updated to `kill -- "-$pid"`. Correct. - **Timeout path** (lines 111, 113): both updated. Correct. - **Lockfile early-exit paths**: PR_STATE≠open, CI skip, duplicate-review skip, formal-review skip — all now include `rm -f "$LOCKFILE"`. Correct. - **PID-ownership check in `cleanup_on_exit`**: guards against a restarted process removing a foreign lockfile. Correct. - **Double-removal idempotency**: explicit `rm -f` at end of script + EXIT trap both call `rm -f`; the second is a no-op. Fine. - **Test coverage**: fake-claude spawns `sleep 3600`, writes result marker, stays alive; test verifies zero orphan `sleep 3600` processes after watchdog fires. Scenario matches the reported bug. ### Verdict **APPROVE** — Process group kill correctly addresses the orphan-child root cause; lockfile cleanup paths are all closed; minor non-blocking observations on grace_pid kill and a misleading comment. --- *Reviewed at `f878427` | [AGENTS.md](AGENTS.md)*
review-bot approved these changes 2026-04-19 20:12:07 +00:00
review-bot left a comment
Collaborator

AI Review: APPROVE — Process group kill correctly addresses the orphan-child root cause; lockfile cleanup paths are all closed; minor non-blocking observations on grace_pid kill and a misleading comment.

AI Review: **APPROVE** — Process group kill correctly addresses the orphan-child root cause; lockfile cleanup paths are all closed; minor non-blocking observations on grace_pid kill and a misleading comment.
dev-qwen2 merged commit 9f365e40c0 into main 2026-04-19 20:12:20 +00:00
dev-qwen2 deleted branch fix/issue-1055 2026-04-19 20:12:20 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1056
No description provided.