Commit graph

168 commits

Author SHA1 Message Date
openhands
bda9240268 refactor: extract ensure_blocked_label_id to lib/ci-helpers.sh (#352)
Move ensure_blocked_label_id() from dev/phase-handler.sh into
lib/ci-helpers.sh to eliminate the duplicate blocked-label creation
curl block that existed in both phase-handler.sh and dev-poll.sh.

Both dev-agent.sh and action-agent.sh now source lib/ci-helpers.sh
so the function is available when phase-handler.sh calls it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 05:06:12 +00:00
openhands
61c44d31b1 fix: refactor: replace escalation JSONL with blocked label + diagnostic comment (#352)
Replace the unreliable escalation JSONL system (supervisor/escalations-*.jsonl
consumed by gardener) with direct blocked label + diagnostic comment on the
original issue.

When a dev-agent or action-agent session fails (PHASE:failed, idle timeout,
crash, CI exhausted):
- Capture last 50 lines from tmux pane via tmux capture-pane
- Post a structured diagnostic comment on the issue (exit reason, timestamp,
  PR number, tmux output)
- Label the issue "blocked" (instead of restoring "backlog")
- Remove in-progress label

Removed:
- Escalation JSONL write paths in dev-agent.sh, phase-handler.sh, dev-poll.sh,
  action-agent.sh
- is_escalated() helper in dev-poll.sh
- Escalation triage (P2f section) in supervisor-poll.sh
- Escalation processing + recipe engine in gardener-poll.sh
- ci-escalation-recipes step from run-gardener.toml formula
- escalations*.jsonl from .gitignore

Added:
- post_blocked_diagnostic() shared helper in phase-handler.sh
- ensure_blocked_label_id() helper (creates label via API if not exists)
- is_blocked() helper in dev-poll.sh (replaces is_escalated)
- Blocked issues listing in supervisor/preflight.sh

Kept:
- Matrix notifications on failure (unchanged)
- CI fix counter logic (still tracks attempts)
- needs_human injection in supervisor/gardener (not escalation-related)
- Gardener grooming (gardener-agent.sh still invoked)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 04:18:43 +00:00
openhands
ab122c9701 fix: PHASE:needs_human missing from crash-path terminal set in monitor_phase_loop (#342)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 03:50:21 +00:00
openhands
a1d47a20f2 fix: eliminate duplicate code blocks flagged by CI dup-detection
Use single-line conditionals for worktree check in PHASE:crashed handler
(phase-handler.sh) to break 5-line window match with idle_timeout case.
Slim dev-agent.sh crashed case to just restore_to_backlog since the
_on_phase_change callback handles full cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 03:27:35 +00:00
openhands
7156f21e12 fix: extract restore_to_backlog() to eliminate duplicate label reset pattern
The cleanup_labels + curl POST + CLAIMED=false pattern was duplicated
across dev-agent.sh (idle_timeout and crashed cases) and phase-handler.sh
(PHASE:crashed handler), triggering duplicate-detection CI failure.

Extract restore_to_backlog() shared helper; call it from all three sites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 02:14:47 +00:00
openhands
7f9cefa847 fix: PHASE:crashed unhandled in _on_phase_change / dev-agent callback (#339)
Add explicit PHASE:crashed case to _on_phase_change in phase-handler.sh:
logs crash, notifies Matrix, escalates to supervisor, restores backlog
label, preserves worktree if PR exists, cleans up temp files.

Add crashed case to dev-agent.sh post-loop case statement for
belt-and-suspenders cleanup matching the callback behavior.

Replaces the dead crash_recovery_failed case that was never triggered.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 01:31:20 +00:00
openhands
e7be534c7d fix: Inner CI/review wait loops bypass exit_marker fast-path (#338)
Add exit_marker file check to the CI wait loop and review wait loop in
phase-handler.sh, matching the pattern already used in monitor_phase_loop
(agent-session.sh). This makes crash detection consistent across all
polling paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 00:55:38 +00:00
openhands
e5965e71d4 fix: Stale REQUEST_CHANGES reviews still trigger re-work (#336)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 00:05:09 +00:00
openhands
e3895ad3ac fix: feat: SessionStart compact hook re-injects phase protocol after context compaction (#274)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 23:27:32 +00:00
openhands
f78fbc1da6 fix: bundled dust cleanup — lib/matrix_listener.sh (#264)
- Remove dead ROOM_ENCODED and EVENT_ID variables from matrix_listener.sh
  (were suppressed with SC2034 instead of removed)
- Remove dead REPO variable from dev-poll.sh and review-poll.sh
- Update header comment in matrix_listener.sh to list all 5 reply-routing
  cases (supervisor, gardener, dev, review, vault, action)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 21:40:31 +00:00
openhands
6405ac9837 fix: use shared scratch helpers in dev-agent and action-agent to eliminate duplicates (#262)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 20:47:22 +00:00
openhands
7199bbf9b5 fix: feat: agents flush context to scratch file before compaction (#262)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 20:12:45 +00:00
openhands
3a1df8f233 fix: dev-poll.sh has no explicit guard for action-labeled issues (#233)
Add skip guards for `action`, `prediction/backlog`, and `prediction/unreviewed`
labels in both the orphan scan and backlog scan, matching the existing `formula`
guard pattern. Issues with these labels will no longer be picked up by dev-agent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 17:15:03 +00:00
johba
a15623b747 fix: fix: action-agent shares phase handler with dev-agent — review lifecycle + cleanup (#388) (#403)
Fixes #388

## Changes
Action-agent now sources dev/phase-handler.sh and enters monitor_phase_loop after prompt injection. Two paths: (A) git output triggers the same PR/CI/review lifecycle as dev-agent, (B) no-git output writes PHASE:done for cleanup. Adds docker compose down on terminal phases, escalation to supervisor on idle timeout, and proper temp file cleanup.

Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/403
Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
2026-03-20 17:39:44 +01:00
openhands
def1ba7814 fix: use numeric IN_PROGRESS_LABEL_ID in DELETE calls (cleanup_labels and cleanup)
Review caught that cleanup_labels() and cleanup() still used the
string name 'in-progress' in DELETE /labels/ URL paths. Switched
both to use ${IN_PROGRESS_LABEL_ID} so label removal actually works
on abort/crash.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 14:45:03 +00:00
openhands
613c6c12cb fix: dev/dev-agent.sh:334 — 'in-progress' label still passed as string name to POST /labels (#222)
Look up IN_PROGRESS_LABEL_ID via the labels API (with hardcoded
fallback) and pass the numeric ID to POST /issues/{id}/labels,
matching the pattern already used for BACKLOG_LABEL_ID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 14:25:00 +00:00
openhands
99a68d3ef5 fix: DELETE /issues/{id}/labels/backlog uses label name not numeric ID (#214)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 14:09:44 +00:00
johba
e47b1967c4 fix: fix: phase handler CI poll uses stale SHA — re-fetch worktree HEAD each cycle (#370) (#380)
Fixes #370

## Changes
Re-fetch CI_CURRENT_SHA from worktree HEAD on each CI poll cycle inside the awaiting_ci handler. Previously the SHA was captured once before the loop, causing stale-SHA polling when Claude pushed new commits mid-wait.

Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/380
Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
2026-03-20 14:29:57 +01:00
openhands
1d797c0303 fix: address review — guard before CI counter, cover all spawn points
- Move tmux session guard BEFORE handle_ci_exhaustion in both CI-fix
  paths so poll cycles with an active session don't waste fix attempts
- Add tmux guards to recovery spawn (orphan, no PR) and both
  agent-merge fallback paths (orphan + stuck-PR)
- Use continue instead of exit 0 when guard fires in stuck-PR loop
  so remaining PRs are still checked

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:45:19 +00:00
openhands
4feb1fba97 fix: dev-poll spawns duplicate agents — no tmux session guard (#371)
Add tmux has-session check before spawning dev-agent.sh at all four
spawn points (orphan REQUEST_CHANGES, orphan CI fix, stuck-PR
REQUEST_CHANGES, stuck-PR CI fix). If a tmux session already exists
for the issue, log and skip instead of spawning a duplicate agent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:45:19 +00:00
openhands
a58aef90d3 fix: too_large branch still uses string label '"underspecified"' (#213)
Look up UNDERSPECIFIED_LABEL_ID via the Gitea labels API (with fallback)
and use the numeric ID in both phase-handler.sh (PHASE:failed/too_large)
and dev-poll.sh (preflight too_large), matching the pattern already used
for BACKLOG_LABEL_ID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 09:50:20 +00:00
openhands
6f30614dda fix: fix: guard blocks merge injection — Claude closes issue without merging (#344)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 07:37:32 +00:00
johba
b78b22d830 fix: fix: dev-poll backlog selection is LIFO — should be FIFO (#349) (#350)
Fixes #349

## Changes
Add &sort=oldest to the backlog API call in dev/dev-poll.sh (line 401) so issues are picked FIFO instead of the Gitea default LIFO order.

Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/350
Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
2026-03-20 08:33:03 +01:00
openhands
f66fcd666c fix: address review — terminal phase guard, explicit marker var, test coverage
- Guard against overwriting terminal phases (PHASE:done, PHASE:merged)
  in on-stop-failure.sh to prevent false failures from same-turn race
- Declare sf_phase_marker explicitly in StopFailure block instead of
  relying on phase_marker from PostToolUse block
- Add authentication_failed test (10c) and terminal phase guard tests
  (10g, 10h)
- Fix fragile nested command substitution in test 10f fail() message

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 01:52:46 +00:00
openhands
eaf2841494 fix: feat: StopFailure hook writes phase file on API error / rate limit (#275)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 01:43:00 +00:00
openhands
70aea63521 fix: Dual curl calls for HAS_APPROVE / HAS_CHANGES create a race window (#321)
Each of the three review-check sites (orphan, stuck-PR, backlog) now
fetches reviews with a single curl call, storing the JSON response and
jq-filtering both HAS_APPROVE and HAS_CHANGES from the cached result.
This eliminates the race window where a review submitted between the
two calls could cause a transient mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 00:59:51 +00:00
openhands
08d702b055 fix: fix: stale REQUEST_CHANGES reviews are invisible to dev-poll stuck-PR check (#319)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 22:24:48 +00:00
openhands
eeb8d5450f fix: agents don't clean up tmux sessions and phase files on completion (#302)
review-pr.sh: After APPROVE verdict, kill tmux session, remove phase
file, review output, sentinel files, and review worktree. Same cleanup
for unknown verdicts. REQUEST_CHANGES keeps session alive per #300.

review-poll.sh: Add safety net in stale session cleanup loop — kill
sessions in terminal phase (PHASE:review_complete) even if review-pr.sh
cleanup was interrupted.

dev/phase-handler.sh: Add sentinel file cleanup (/tmp/ci-result-*,
/tmp/review-injected-*) to PHASE:done and PHASE:failed handlers.

dev-agent.sh: Add sentinel file cleanup to idle_timeout/idle_prompt
exit handler. Add belt-and-suspenders done) case to post-loop handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 20:30:27 +00:00
openhands
809dd93c3b fix: distinguish phase file writes from reads in PostToolUse hook
- Parse tool_name via jq: Write tool checks file_path match,
  Bash tool checks for redirect operator (>) with phase file path
- Reads (cat, head) no longer trigger false-positive markers
- Split guard into separate statements for clarity
- Move marker cleanup inside hook-install guard
- Expand tests: 5 cases covering Bash write, Write tool, Bash read,
  unrelated Bash, and Write to different file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 18:14:49 +00:00
openhands
ac04dc29a6 fix: feat: PostToolUse hook detects phase file writes in real-time (eliminates polling latency) (#278)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 17:55:06 +00:00
openhands
1ab700c87a fix: feat: review + dev-poll skip CI gate for non-code PRs (#266)
Add diff_has_code_files() and ci_required_for_pr() helpers to
ci-helpers.sh. Non-code PRs (docs/*, formulas/*, evidence/*, *.md)
that have no CI results now skip the CI gate instead of being stuck
forever.

Applied to:
- review-pr.sh: CI gate skipped for non-code PRs
- review-poll.sh: CI gate skipped for non-code PRs
- dev-poll.sh: CI state treated as "success" for non-code PRs in
  orphan, stuck-PR, and backlog merge paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 13:48:00 +00:00
johba
d5c2c213a3 fix: bug: gardener hangs forever when Claude finishes without writing phase file (#261) (#263)
Fixes #261

## Changes
Fixed gardener hanging forever when Claude skips phase protocol. Three changes: (1) gardener-agent.sh: replaced 999999s timeout with 7200s (2h, matching dev-agent); (2) lib/agent-session.sh: added idle-prompt detection to monitor_phase_loop — if Claude returns to the ❯ prompt for 3 consecutive polls with no phase file written, exits immediately with _MONITOR_LOOP_EXIT=idle_prompt (only fires when phase file is empty, so awaiting_ci/review waits are unaffected); (3) gardener prompt: removed 'no time limit' wording, replaced with explicit phase-write requirement.

Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/263
Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
2026-03-19 13:47:10 +01:00
openhands
7cf1d035e0 fix: use original issue body for dep parsing and PR recovery detection
Prevent human comments appended to ISSUE_BODY from causing false
positive dependency blocks or spurious 'Existing PR:' recovery matches
in parse-deps.sh and the PR recovery guard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 08:25:02 +00:00
openhands
d40a9c36c5 fix: dev-agent reads issue comments alongside body (#237)
Fetches issue comments via Codeberg API and appends human comments
to the issue body in the Claude prompt. Bot comments (Disinto_bot,
disinto-factory) are filtered out.

One API call, zero new dependencies.
2026-03-19 07:56:11 +00:00
openhands
54fa568935 fix: dev/dev-agent.sh still passes string label name to /labels replace endpoint (#202)
Look up the backlog label ID via the Gitea labels API (with fallback to
1300815) and replace '{"labels":["backlog"]}' with the integer ID form
at both call sites (cleanup() line 135 and idle_timeout handler line 713).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 00:45:57 +00:00
openhands
08b65ddc43 fix: dev/phase-handler.sh still passes string label name to /labels replace endpoint (#203)
Look up backlog label ID via Codeberg API at the start of the PHASE:failed
branch and replace '{"labels":["backlog"]}' at lines 547 and 628 with
the numeric ID, matching the pattern already used in gardener.
2026-03-18 22:06:05 +00:00
openhands
d1cea6c0bb fix: apply same REQUEST_CHANGES/CI-pending fix to PRIORITY 1 block 2026-03-18 21:03:53 +00:00
openhands
34ddbef3fd fix: PRIORITY 1.5 misses REQUEST_CHANGES when CI is not yet settled (#41) 2026-03-18 20:50:56 +00:00
openhands
bb2af8db10 fix: address review feedback — set -e bug, sentinel path, fragile grep, stale comment (#171)
- Fix set -e bug: use `_merge_rc=0; do_merge ... || _merge_rc=$?` so non-zero
  returns don't kill the agent before _merge_rc is captured
- Fix sentinel path: skip sentinel break for APPROVE so do_merge() always runs,
  even when review-poll.sh injected the verdict first
- Fix fragile grep: match HTTP 405 alone instead of `grep -qi "not enough"` —
  any 405 from the merge endpoint is a structural block (approvals, branch
  protection), not a transient error
- Fix stale comment/status in PHASE:done handler: "orchestrator or Claude"
  instead of "agent"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 19:53:26 +00:00
openhands
374fe2b2b4 fix: fix: dev-agent merge failure on "not enough approvals" should escalate immediately (#171)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 19:45:13 +00:00
openhands
f73d5f471e fix: feat: dev-agent merges its own PRs via non-admin Codeberg account (#172)
- phase-handler.sh: remove do_merge(); on APPROVAL inject exact API
  commands for agent to merge+close directly; PHASE:done now only
  does local cleanup (tmux, worktree, labels) — merge already done
- dev-agent.sh: update PHASE_PROTOCOL_INSTRUCTIONS — Approved means
  merge via API, close issue, then write PHASE:done
- dev-poll.sh: remove try_merge_or_rebase(); for approved+CI-green
  orphaned PRs, spawn dev-agent (recovery mode) to merge instead
- .env.example: document new token roles (CODEBERG_TOKEN = bot for
  push/PR/merge; REVIEW_BOT_TOKEN = human account for approvals)
- AGENTS.md: update token descriptions to match new roles

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 17:59:36 +00:00
openhands
42fa8f48e0 fix: restore log, notify, notify_ctx functions to dev-agent.sh
Lost during #160 refactor. These are dev-agent specific (reference
$ISSUE, $THREAD_FILE, $LOGFILE) so they belong in the agent script,
not the shared library.
2026-03-18 16:37:55 +00:00
openhands
d83098f382 fix: pass SESSION_NAME to all agent-session.sh function calls
Library functions need explicit session name argument — they no longer
have closure over $SESSION_NAME from the parent script.

- agent_kill_session: add $SESSION_NAME to all 11 call sites
- agent_inject_into_session: add $SESSION_NAME to all call sites in
  phase-handler.sh and gardener-agent.sh
- agent_kill_session: guard against missing arg (defensive)
2026-03-18 16:24:58 +00:00
openhands
ae3e742f9f fix: rename function calls to match agent-session.sh exports (#176)
kill_tmux_session → agent_kill_session
inject_into_session → agent_inject_into_session
wait_for_claude_ready → agent_wait_for_claude_ready

Also restore status() function lost during #160 refactor.

Fixes dev-agent and gardener-agent crash on startup:
  line 149: status: command not found
  line 280: kill_tmux_session: command not found
2026-03-18 16:10:12 +00:00
johba
d27f6bcb99 fix: refactor: slim dev-agent.sh to use lib/agent-session.sh (#160) (#173)
Fixes #160

## Changes
Extracted phase callback functions (post_refusal_comment, do_merge, _on_phase_change) from dev/dev-agent.sh into new dev/phase-handler.sh. dev-agent.sh now sources both lib/agent-session.sh and dev/phase-handler.sh. Replaced inline dependency extraction with lib/parse-deps.sh. dev-agent.sh reduced from 1516 to 684 lines (55% reduction). AGENTS.md shellcheck command updated to include the new files.

Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/173
Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
2026-03-18 16:52:14 +01:00
openhands
cbd8c81da8 refactor: extract lib/agent-session.sh — reusable tmux + Claude agent runtime (#158)
Move generic agent infrastructure from dev/dev-agent.sh into lib/agent-session.sh:
- log, status, notify, notify_ctx, read_phase, wait_for_claude_ready,
  inject_into_session, kill_tmux_session extracted verbatim
- create_agent_session(session_name, workdir) — new: tmux session creation
- inject_formula(session_name, formula_text, context) — new: prompt injection
- monitor_phase_loop(phase_file, idle_timeout, callback_fn) — new: phase loop
  with session health check, crash recovery, and idle timeout detection

dev-agent.sh: sources the library, implements _on_phase_change() callback,
calls monitor_phase_loop(); idle-timeout and crash-recovery-failed cleanup
handled via _MONITOR_LOOP_EXIT signal variable. Behavior unchanged.
2026-03-18 14:36:36 +00:00
openhands
d904192ab7 fix: Escalation write-once guard is not atomic (pre-existing) (#154)
- `ci_fix_check_and_increment` now accepts an optional `check_only` arg:
  - count < 3, check_only: returns `ok:N` without incrementing (deferred
    to launch time, preserving the WAITING_PRS protection)
  - count < 3, non-check_only: increments and returns `ok:N` (unchanged)
  - count == 3 (any mode): atomically bumps to 4 and returns
    `exhausted_first_time:3` — only one concurrent poller can win this
  - count > 3 (any mode): returns `exhausted:N` with no write

- `handle_ci_exhaustion` unified to a single code path for both
  check_only and non-check_only:
  - Writes escalation JSONL + matrix_send only when sentinel is
    `exhausted_first_time` — never on a bare integer comparison outside
    a lock
  - Removes the two separate `ci_fix_increment` bump-to-4 calls that
    were racy (the sentinel bump is now inside the flock in Python)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 11:44:30 +00:00
openhands
e088f3e7ae fix: dev-agent CI retrigger sets LAST_PHASE_MTIME equal to touched phase file — main loop never re-enters awaiting_ci (#148)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 11:16:05 +00:00
openhands
4dc64ea65b fix: restore deferred increment for backlog path to prevent counter leak
The previous commit introduced a counter leak in the backlog scan path:
handle_ci_exhaustion (without check_only) atomically incremented the CI
fix counter before the WAITING_PRS guard, so an exit 0 that never spawned
a dev-agent would silently consume one of the three allowed fix attempts.

Restore the READY_PR_FOR_INCREMENT / deferred-increment mechanism:
- Backlog scan calls handle_ci_exhaustion with "check_only" (read-only,
  no increment) to detect exhaustion without touching the counter.
- The counter is bumped atomically at LAUNCH time via handle_ci_exhaustion
  (without check_only), so the increment only happens when we are certain
  a dev-agent is being spawned. If a concurrent poller already exhausted
  the counter between scan and launch, the LAUNCH call returns 0 and we
  bail out cleanly without double-spawning.

The in-progress, stuck-PR, and try_merge_or_rebase paths are unaffected:
they call handle_ci_exhaustion without check_only, which continues to use
the atomic ci_fix_check_and_increment to prevent concurrent double-spawning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 10:34:41 +00:00
openhands
7bf13567fd fix: TOCTOU in handle_ci_exhaustion: check-then-act not atomic (#125)
Add ci_fix_check_and_increment() that performs read + threshold-check +
conditional increment in a single flock-protected Python call, replacing
the prior three-step sequence (ci_fix_count / bash check / ci_fix_increment)
that allowed two concurrent poll invocations to both pass the threshold and
spawn duplicate dev-agents for the same PR.

handle_ci_exhaustion now calls ci_fix_check_and_increment atomically and
returns the new count in CI_FIX_ATTEMPTS; all separate ci_fix_increment
calls after handle_ci_exhaustion (including the deferred READY_PR_FOR_INCREMENT
mechanism) are removed. Log messages updated from CI_FIX_ATTEMPTS+1 to
CI_FIX_ATTEMPTS to reflect the post-increment count.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 10:22:24 +00:00