incident: PR #872 sat abandoned 2h22m after legit CI fail — supervisor needs an unblock-PR sweeper (2026-04-16) #894
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#894
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Incident
2026-04-16 13:22–16:17 UTC. dev-qwen claimed issue #850 (compose generator dup-detection), opened PR #872 with a working
lib/generators.shchange + a passing unit test (tests/test-duplicate-service-detection.sh) + a new section 8/8 intests/smoke-init.sh. The smoke-test addition was structurally broken: it re-invokesbin/disinto initafter sections 1–7 have already materializeddocker-compose.yml, so_generate_compose_implhits its early-return atlib/generators.sh:298("Compose: already exists, skipping") and never reaches the new dup-check. The grep forDuplicate service name 'agents-llama'in step-8 output finds nothing → FAIL.Blast radius: dev-qwen marked the issue
ci_exhaustedat 13:55:33 — 38 seconds after the failing pipeline #1001 finished at 13:55:20 — and moved on. Issue and PR sat stuck for 2h 22m until an operator (johba, via Claude) manually diagnosed, harvested the diff, closed the PR with prior-art reference, and returned the issue to the pool with a one-line fix suggestion (rm -f docker-compose.ymlbefore the step-8 init). Nothing in the supervisor loop noticed.Why this stayed stuck
This is not the #867 pattern (infra flake burning budget). The CI failure here was legitimate and reproducible — the test as written cannot pass against the code as written. dev-qwen's
pr-lifecyclebudget exhausted correctly; the agent self-reportedci_exhaustedcorrectly. The gap is downstream: once an issue lands inblocked: ci_exhaustedwith an open PR, no supervisor probe attempts to unblock it. It just sits.The manual recovery this time was four moves:
All four are mechanical given the failing-step text. None require the model to re-attempt the work — only to triage and re-queue.
Suggestions — what supervisor can check every 20min
Supervisor already runs every
1200s. Add an abandoned-PR sweeper before any other probe:1. Find abandoned PRs.
For each open PR, check the linked issue (parse
Fixes #N/Closes #Nfrom the body). If the issue carries labelblocked: ci_exhaustedand the PR's latest pipeline status isfailureand the PR has no human-author commits in the last 30min → candidate for sweep.2. Harvest the failing step.
Extract the last 50 lines of the failing step. Feed them + the PR diff into a small triage prompt:
Pipeline state discriminates the two infra-ish classes:
error(no steps ran) points hard atci-config;failurewith sub-60s duration points atflake.3. Act on the verdict.
flake→ restartdisinto-woodpecker-agent+ retrigger pipeline (#867 path).ci-config→ close the PR (keep branch), append the parse-error + line number + fix suggestion to the originating issue's body under a## Prior art: PR #Nsection. PRs in this class never post commit statuses back to Forgejo, so they show a blank CI badge rather than a red X — the sweeper must probe Woodpecker directly forerror-state pipelines, not rely on the Forgejo/commits/<sha>/statusendpoint (which returns empty for this class). Return to pool.test-harness→ close the PR (keep branch), append the failing-step extract + the triage one-liner to the originating issue's body under a## Prior art: PR #Nsection, unassign + removeblockedlabel, return to pool.code→ leave PR open, post a comment on the issue summarizing the failure for the next claimant, unassign + removeblockedso the next dev agent picks it up cleanly. (Don't auto-close — code-failure PRs are partial progress worth keeping reviewable.)4. Idempotency guard.
Supervisor must mark touched PRs/issues so it doesn't sweep them twice. A trailing comment from
supervisor-botcontaining<!-- supervisor-swept -->is enough.Why this matters
The migration backlog has the shape of "many small
agents.*config issues, each one PR-sized." Every one of them carries non-zero risk of dev-qwen producing a structurally-broken test that exhausts budget legitimately. Without a sweeper, every such mis-step costs an operator 20–30min of manual triage. With a sweeper, the cost is one supervisor tick (~30s) and the issue is back in the pool with strictly more context than it had before.This is the cheapest reachable improvement to throughput before cutover. It also generalizes: the same harvest-and-triage primitive is what an eventual
auto-reviewer-botwould call.Fix for this specific incident (already applied manually)
fix/issue-850atdb009e3).## Prior art: PR #872section.blocked— leaving that for the supervisor sweeper once it exists, so we can validate the sweeper against a real pre-existing case.Second validating case: PR #896 / issue #884 (2026-04-16 16:46–18:04)
Same signature,
ci-configbranch: dev-qwen opened PR #896 for issue #884, rewrote.woodpecker/nomad-validate.ymlwith an inlinedcat <<EOFheredoc whose body started at column 0 inside a- |block scalar. Pipelines #1037/#1038 errored withyaml: line 301: could not find expected ':'— pre-step, no Forgejo status posted, PR UI showed blank. dev-qwen exitedci_timeout; issue satblockeduntil manual sweep. Resolution mirrored #872: close PR, retain branchfix/issue-884at108b928c, append diagnosis +cp lib/secret-scan.sh /tmp/suggestion to #884 body, leaveblocked+ assigned for future sweeper validation. Validates the need for theci-configverdict branch added above.Non-goals
pr-lifecyclebudget. Supervisor probes the outcome (issue stuck inblockedwith an open PR), not the agent's internal state.Labels / meta
vision— supervisor capability we don't have yet, distinct from any single bug.