fix: agent-sdk.sh agent_run has no session lock — concurrent claude -p crashes #261

Closed
opened 2026-04-05 20:47:05 +00:00 by dev-bot · 0 comments
Collaborator

Problem

lib/agent-sdk.sh agent_run() calls claude -p directly (line 51) without any locking. When two agents run concurrently in the same container (e.g. dev-poll at :04 and review-poll at :07), two claude -p processes share the same ~/.claude/ config directory. This causes one session to crash with empty stdout (0 bytes JSON output).

Observed: dev agent working on #239 died at 20:08 with empty output. The review agent started reviewing PR #257 at 20:07. Their sessions overlapped for 1 minute. The dev session's JSONL ends with last-prompt (no result entry) — the CLI process died mid-execution.

This happened 3 times on the same issue, each time when a review session overlapped.

Root cause

The session lock (~/.claude/session.lock via flock) exists in lib/agent-session.sh (the old tmux-based path) but was never added to lib/agent-sdk.sh (the current claude -p path). All agents migrated to the SDK path, so no agent acquires the lock.

Fix

Wrap both claude -p invocations in agent_run() with flock:

# Before line 51 (initial run)
local lock_file="${HOME}/.claude/session.lock"
mkdir -p "$(dirname "$lock_file")"
output=$(cd "$run_dir" && flock -w 600 "$lock_file" timeout "${CLAUDE_TIMEOUT:-7200}" claude "${args[@]}" 2>>"$LOGFILE") || true

# Same for line 82 (nudge)
output=$(cd "$run_dir" && flock -w 600 "$lock_file" timeout "${CLAUDE_TIMEOUT:-7200}" claude -p "$nudge" ...) || true

The flock timeout (600s) should be longer than any reasonable review session so the dev agent waits rather than failing.

Affected files

  • lib/agent-sdk.sh (agent_run, lines 51 and 82 — add flock around claude -p calls)

Acceptance criteria

  • Only one claude -p process runs at a time per container
  • A second agent waits for the lock rather than crashing
  • Lock file is ~/.claude/session.lock (same path as the old tmux lock for consistency)
  • No empty-output crashes when dev and review overlap
## Problem lib/agent-sdk.sh agent_run() calls `claude -p` directly (line 51) without any locking. When two agents run concurrently in the same container (e.g. dev-poll at :04 and review-poll at :07), two `claude -p` processes share the same `~/.claude/` config directory. This causes one session to crash with empty stdout (0 bytes JSON output). Observed: dev agent working on #239 died at 20:08 with empty output. The review agent started reviewing PR #257 at 20:07. Their sessions overlapped for 1 minute. The dev session's JSONL ends with `last-prompt` (no result entry) — the CLI process died mid-execution. This happened 3 times on the same issue, each time when a review session overlapped. ## Root cause The session lock (`~/.claude/session.lock` via flock) exists in `lib/agent-session.sh` (the old tmux-based path) but was never added to `lib/agent-sdk.sh` (the current `claude -p` path). All agents migrated to the SDK path, so no agent acquires the lock. ## Fix Wrap both `claude -p` invocations in agent_run() with flock: # Before line 51 (initial run) local lock_file="${HOME}/.claude/session.lock" mkdir -p "$(dirname "$lock_file")" output=$(cd "$run_dir" && flock -w 600 "$lock_file" timeout "${CLAUDE_TIMEOUT:-7200}" claude "${args[@]}" 2>>"$LOGFILE") || true # Same for line 82 (nudge) output=$(cd "$run_dir" && flock -w 600 "$lock_file" timeout "${CLAUDE_TIMEOUT:-7200}" claude -p "$nudge" ...) || true The flock timeout (600s) should be longer than any reasonable review session so the dev agent waits rather than failing. ## Affected files - lib/agent-sdk.sh (agent_run, lines 51 and 82 — add flock around claude -p calls) ## Acceptance criteria - [ ] Only one `claude -p` process runs at a time per container - [ ] A second agent waits for the lock rather than crashing - [ ] Lock file is `~/.claude/session.lock` (same path as the old tmux lock for consistency) - [ ] No empty-output crashes when dev and review overlap
dev-bot added the
backlog
priority
labels 2026-04-05 20:47:05 +00:00
dev-qwen self-assigned this 2026-04-05 20:52:40 +00:00
dev-qwen added
in-progress
and removed
backlog
labels 2026-04-05 20:52:40 +00:00
dev-qwen was unassigned by dev-bot 2026-04-05 20:59:02 +00:00
dev-bot removed the
in-progress
label 2026-04-05 20:59:02 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#261
No description provided.