Orphan and stuck-PR CI-failure paths in dev-poll.sh called
handle_ci_exhaustion without check_only, incrementing the fix counter on
every poll cycle even when guards (session checks, is_blocked) prevented
an actual agent launch. This could exhaust the 3-attempt budget without
any real fix attempts.
Now both paths use the same two-phase pattern as the backlog scan:
1. check_only during the scan (no counter increment)
2. Increment atomically at actual launch time
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add ci_failed() helper to lib/ci-helpers.sh and replace three compound
`! ci_passed && CI_STATE != "" && != "pending" && != "unknown"` patterns
in dev/dev-poll.sh with the cleaner ci_failed() call.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move ensure_blocked_label_id() from dev/phase-handler.sh into
lib/ci-helpers.sh to eliminate the duplicate blocked-label creation
curl block that existed in both phase-handler.sh and dev-poll.sh.
Both dev-agent.sh and action-agent.sh now source lib/ci-helpers.sh
so the function is available when phase-handler.sh calls it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the unreliable escalation JSONL system (supervisor/escalations-*.jsonl
consumed by gardener) with direct blocked label + diagnostic comment on the
original issue.
When a dev-agent or action-agent session fails (PHASE:failed, idle timeout,
crash, CI exhausted):
- Capture last 50 lines from tmux pane via tmux capture-pane
- Post a structured diagnostic comment on the issue (exit reason, timestamp,
PR number, tmux output)
- Label the issue "blocked" (instead of restoring "backlog")
- Remove in-progress label
Removed:
- Escalation JSONL write paths in dev-agent.sh, phase-handler.sh, dev-poll.sh,
action-agent.sh
- is_escalated() helper in dev-poll.sh
- Escalation triage (P2f section) in supervisor-poll.sh
- Escalation processing + recipe engine in gardener-poll.sh
- ci-escalation-recipes step from run-gardener.toml formula
- escalations*.jsonl from .gitignore
Added:
- post_blocked_diagnostic() shared helper in phase-handler.sh
- ensure_blocked_label_id() helper (creates label via API if not exists)
- is_blocked() helper in dev-poll.sh (replaces is_escalated)
- Blocked issues listing in supervisor/preflight.sh
Kept:
- Matrix notifications on failure (unchanged)
- CI fix counter logic (still tracks attempts)
- needs_human injection in supervisor/gardener (not escalation-related)
- Gardener grooming (gardener-agent.sh still invoked)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove dead ROOM_ENCODED and EVENT_ID variables from matrix_listener.sh
(were suppressed with SC2034 instead of removed)
- Remove dead REPO variable from dev-poll.sh and review-poll.sh
- Update header comment in matrix_listener.sh to list all 5 reply-routing
cases (supervisor, gardener, dev, review, vault, action)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add skip guards for `action`, `prediction/backlog`, and `prediction/unreviewed`
labels in both the orphan scan and backlog scan, matching the existing `formula`
guard pattern. Issues with these labels will no longer be picked up by dev-agent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move tmux session guard BEFORE handle_ci_exhaustion in both CI-fix
paths so poll cycles with an active session don't waste fix attempts
- Add tmux guards to recovery spawn (orphan, no PR) and both
agent-merge fallback paths (orphan + stuck-PR)
- Use continue instead of exit 0 when guard fires in stuck-PR loop
so remaining PRs are still checked
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add tmux has-session check before spawning dev-agent.sh at all four
spawn points (orphan REQUEST_CHANGES, orphan CI fix, stuck-PR
REQUEST_CHANGES, stuck-PR CI fix). If a tmux session already exists
for the issue, log and skip instead of spawning a duplicate agent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Look up UNDERSPECIFIED_LABEL_ID via the Gitea labels API (with fallback)
and use the numeric ID in both phase-handler.sh (PHASE:failed/too_large)
and dev-poll.sh (preflight too_large), matching the pattern already used
for BACKLOG_LABEL_ID.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes#349
## Changes
Add &sort=oldest to the backlog API call in dev/dev-poll.sh (line 401) so issues are picked FIFO instead of the Gitea default LIFO order.
Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/350
Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
Each of the three review-check sites (orphan, stuck-PR, backlog) now
fetches reviews with a single curl call, storing the JSON response and
jq-filtering both HAS_APPROVE and HAS_CHANGES from the cached result.
This eliminates the race window where a review submitted between the
two calls could cause a transient mismatch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add diff_has_code_files() and ci_required_for_pr() helpers to
ci-helpers.sh. Non-code PRs (docs/*, formulas/*, evidence/*, *.md)
that have no CI results now skip the CI gate instead of being stuck
forever.
Applied to:
- review-pr.sh: CI gate skipped for non-code PRs
- review-poll.sh: CI gate skipped for non-code PRs
- dev-poll.sh: CI state treated as "success" for non-code PRs in
orphan, stuck-PR, and backlog merge paths
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- phase-handler.sh: remove do_merge(); on APPROVAL inject exact API
commands for agent to merge+close directly; PHASE:done now only
does local cleanup (tmux, worktree, labels) — merge already done
- dev-agent.sh: update PHASE_PROTOCOL_INSTRUCTIONS — Approved means
merge via API, close issue, then write PHASE:done
- dev-poll.sh: remove try_merge_or_rebase(); for approved+CI-green
orphaned PRs, spawn dev-agent (recovery mode) to merge instead
- .env.example: document new token roles (CODEBERG_TOKEN = bot for
push/PR/merge; REVIEW_BOT_TOKEN = human account for approvals)
- AGENTS.md: update token descriptions to match new roles
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `ci_fix_check_and_increment` now accepts an optional `check_only` arg:
- count < 3, check_only: returns `ok:N` without incrementing (deferred
to launch time, preserving the WAITING_PRS protection)
- count < 3, non-check_only: increments and returns `ok:N` (unchanged)
- count == 3 (any mode): atomically bumps to 4 and returns
`exhausted_first_time:3` — only one concurrent poller can win this
- count > 3 (any mode): returns `exhausted:N` with no write
- `handle_ci_exhaustion` unified to a single code path for both
check_only and non-check_only:
- Writes escalation JSONL + matrix_send only when sentinel is
`exhausted_first_time` — never on a bare integer comparison outside
a lock
- Removes the two separate `ci_fix_increment` bump-to-4 calls that
were racy (the sentinel bump is now inside the flock in Python)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous commit introduced a counter leak in the backlog scan path:
handle_ci_exhaustion (without check_only) atomically incremented the CI
fix counter before the WAITING_PRS guard, so an exit 0 that never spawned
a dev-agent would silently consume one of the three allowed fix attempts.
Restore the READY_PR_FOR_INCREMENT / deferred-increment mechanism:
- Backlog scan calls handle_ci_exhaustion with "check_only" (read-only,
no increment) to detect exhaustion without touching the counter.
- The counter is bumped atomically at LAUNCH time via handle_ci_exhaustion
(without check_only), so the increment only happens when we are certain
a dev-agent is being spawned. If a concurrent poller already exhausted
the counter between scan and launch, the LAUNCH call returns 0 and we
bail out cleanly without double-spawning.
The in-progress, stuck-PR, and try_merge_or_rebase paths are unaffected:
they call handle_ci_exhaustion without check_only, which continues to use
the atomic ci_fix_check_and_increment to prevent concurrent double-spawning.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add ci_fix_check_and_increment() that performs read + threshold-check +
conditional increment in a single flock-protected Python call, replacing
the prior three-step sequence (ci_fix_count / bash check / ci_fix_increment)
that allowed two concurrent poll invocations to both pass the threshold and
spawn duplicate dev-agents for the same PR.
handle_ci_exhaustion now calls ci_fix_check_and_increment atomically and
returns the new count in CI_FIX_ATTEMPTS; all separate ci_fix_increment
calls after handle_ci_exhaustion (including the deferred READY_PR_FOR_INCREMENT
mechanism) are removed. Log messages updated from CI_FIX_ATTEMPTS+1 to
CI_FIX_ATTEMPTS to reflect the post-increment count.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wrap ci_fix_count(), ci_fix_increment(), and ci_fix_reset() with flock
on a shared lockfile to prevent concurrent modification of the JSON
tracker. Uses flock(1) in command-wrapping mode so each Python process
holds an exclusive lock for the duration of its read-modify-write cycle.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract CI-exhaustion check/escalate logic into handle_ci_exhaustion() helper.
All three call sites (orphaned PRs, stuck PRs, backlog PRs) now use the shared
function, eliminating future drift between the copies.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix SC2164: add || exit 1 to bare cd in update-prompt.sh
- Fix SC2155: separate declare and assign in env.sh, supervisor-poll.sh, dev-agent.sh
- Fix SC2034: inline suppression for vars used by sourced helpers
- Remove unused `mergeable` declaration, rename unused loop var to `_w`
- Remove || true from shellcheck CI step — failures are now blocking
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- supervisor: skip *.done.jsonl in escalation glob (bug: wildcard matched
harb.done.jsonl producing spurious 'pending' log noise every cycle)
- supervisor: use wc -l instead of grep -c . for line counting (style nit)
- supervisor: consume gardener-esc-resolved.log via fixed() so escalation
resolutions appear in end-of-cycle supervisor reporting
- dev-poll: update all 'escalated to supervisor' log/matrix strings to
'escalated to gardener' (lines 263, 268, 344, 420)
- gardener: track _esc_total_created across all escalation entries and
write count to supervisor/gardener-esc-resolved.log after processing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- dev-poll.sh: write escalations to per-project files
(supervisor/escalations-{PROJECT_NAME}.jsonl) and add "project" field
so each project's escalations are isolated; update is_escalated() to
read from the same per-project paths
- gardener-poll.sh: add escalation processing block that reads the
per-project escalation file, fetches CI logs via Woodpecker, and
creates per-file ShellCheck sub-issues or generic CI failure issues
labeled backlog — runs with the correct CODEBERG_API and
WOODPECKER_REPO_ID already loaded from the project TOML
- supervisor-poll.sh: remove the escalation processing block; replace
with a simple flog report counting pending escalations per project
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Race condition: mv escalations.jsonl to a PID-stamped snapshot before
processing so concurrent dev-poll appends go to a fresh file; rm snapshot
after loop — no entries are ever silently dropped
- SQL injection: validate ESC_PR_SHA is a 40-char hex string before
interpolating into the wpdb query
- sc_codes scope: compute per-file from file_errors (already filtered to
that file) instead of the entire step log; also switch grep to -F so
dots in filenames are not treated as regex wildcards
- step_pid validation: reject non-integer values from Woodpecker API before
passing as CLI argument
- Fallback body now distinguishes "CI logs unavailable" from "logs found
but issue creation API calls failed"
- ESC_GENERIC_FAIL: avoid leading blank line by using conditional separator
and fix code-block opening newline
- is_escalated(): remove dead esc_file/done_file locals; add Python-level
int() guard so empty/non-numeric issue or pr values fail cleanly instead
of producing a syntax error suppressed by 2>/dev/null
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- supervisor-poll.sh: replace P3 escalation log with actionable sub-issue creation.
For each entry in escalations.jsonl: fetch CI logs via woodpecker-cli, create one
sub-issue per file for ShellCheck failures, one combined issue for other CI failures,
or a fallback investigation issue if logs are unavailable. Move processed entries to
escalations.done.jsonl and clear escalations.jsonl.
- dev-poll.sh: add is_escalated() helper that checks both escalations.jsonl and
escalations.done.jsonl; use it (alongside ci_fix_count >= 3) in all three CI-fix
spawn paths so escalated PRs are skipped even if the ci-fixes tracker is reset.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs after #53 merged:
1. Escalation written every poll cycle (4 entries in 30min) — now writes once, bumps counter to 4 to skip
2. Exit after escalation blocked backlog work — now falls through to pick up next issue
Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/59
Reviewed-by: review_bot <review_bot@noreply.codeberg.org>
Dev-poll spawned a fresh agent every 10min for CI failures. Each agent started with CI_FIX_COUNT=0 — infinite loop.
Now tracks attempts per PR in `/tmp/dev-poll-ci-fixes-{project}.json`. After 3 failed rounds:
- Writes escalation to `supervisor/escalations.jsonl`
- Sends Matrix alert
- Stops respawning
Part of #52 (supervisor escalation pipeline).
Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/53
Reviewed-by: review_bot <review_bot@noreply.codeberg.org>
dev-poll.sh had 5 places checking CI_STATE='success', all blocking
projects without CI. Extracted ci_passed() helper that treats
empty/pending/unknown as pass when WOODPECKER_REPO_ID=0.
Don't start new issues while open PRs are waiting for review/CI.
This prevents dev-agent from churning through backlog issues
without reviews landing first.
Hardcoded /tmp/dev-agent.lock meant harb and disinto dev-polls shared
a lock — one project's running agent blocked the other. Now uses
/tmp/dev-agent-{project}.lock and dev-agent-{project}.log.
Replace lib/parse-deps.py with lib/parse-deps.sh to keep the toolchain
all-bash. Rewrite supervisor P3b cycle detection and P3c stale dep check
as pure bash using associative arrays and DFS.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single source of truth for dependency parsing, replacing three copies:
- dev-poll.sh get_deps() now calls parse-deps.py
- supervisor P3b/P3c import parse_deps() via importlib
Supports stdin, argument, and --json modes for different callers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The supervisor agent was confusingly named "factory" (same as the
project). Rename directory, script, log, lock, status, and escalation
files. Update all references across scripts and docs.
FACTORY_ROOT env var unchanged (refers to project root, not agent).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codeberg's mergeable field flickers between true/false — unreliable
for deciding whether to rebase. Just attempt rebase on any non-200/204.
Worst case it's a no-op. Also added git fetch before rebase.
When merge returns non-200, check mergeable flag. If false,
rebase the PR branch onto master via worktree. If rebase fails,
spawn dev-agent to resolve. Prevents infinite 405 retry loops.
Extracted try_merge_or_rebase() helper used at all 3 merge points.
Add matrix_send() to lib/env.sh and matrix_listener.sh daemon for
real-time notifications, threaded escalations, and human-in-the-loop
replies. All agents now notify via Matrix instead of openclaw.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PRs #684 and #710 had no issue number in branch name or title.
Now also checks PR body for 'Closes #NNN'. If still no issue found,
logs a skip (dev-agent requires an issue number to work).
PRs with custom branch names (fix/fitness-factory-address,
chore/seed-consolidation) were invisible to priority 1.5.
Now also extracts issue number from PR title (#NNN) as fallback.
1. PRIORITY 1.5 in dev-poll: scan ALL open PRs for REQUEST_CHANGES or CI
failure before picking new backlog issues. Stuck PRs get fixed first
to avoid complex rebases piling up.
2. STATE.md written in worktree before claude starts (included in first
commit, not a separate push that dismisses stale approvals).
3. Removed HTTP 405 from merge success check in dev-poll.sh (was fixed
in dev-agent.sh but not here — 2 occurrences).
The merged-PR search was over-engineered and caused false negatives
(couldn't match PR to issue when title/body didn't contain #NNN).
Issue closed = dep satisfied. Factory only closes after merging.