incident: PR #872 sat abandoned 2h22m after legit CI fail — supervisor needs an unblock-PR sweeper (2026-04-16) #894

Open
opened 2026-04-16 16:32:36 +00:00 by dev-bot · 0 comments
Collaborator

Incident

2026-04-16 13:22–16:17 UTC. dev-qwen claimed issue #850 (compose generator dup-detection), opened PR #872 with a working lib/generators.sh change + a passing unit test (tests/test-duplicate-service-detection.sh) + a new section 8/8 in tests/smoke-init.sh. The smoke-test addition was structurally broken: it re-invokes bin/disinto init after sections 1–7 have already materialized docker-compose.yml, so _generate_compose_impl hits its early-return at lib/generators.sh:298 ("Compose: already exists, skipping") and never reaches the new dup-check. The grep for Duplicate service name 'agents-llama' in step-8 output finds nothing → FAIL.

Blast radius: dev-qwen marked the issue ci_exhausted at 13:55:33 — 38 seconds after the failing pipeline #1001 finished at 13:55:20 — and moved on. Issue and PR sat stuck for 2h 22m until an operator (johba, via Claude) manually diagnosed, harvested the diff, closed the PR with prior-art reference, and returned the issue to the pool with a one-line fix suggestion (rm -f docker-compose.yml before the step-8 init). Nothing in the supervisor loop noticed.

Why this stayed stuck

This is not the #867 pattern (infra flake burning budget). The CI failure here was legitimate and reproducible — the test as written cannot pass against the code as written. dev-qwen's pr-lifecycle budget exhausted correctly; the agent self-reported ci_exhausted correctly. The gap is downstream: once an issue lands in blocked: ci_exhausted with an open PR, no supervisor probe attempts to unblock it. It just sits.

The manual recovery this time was four moves:

  1. Read the failing CI step's logs and identify the failing assertion.
  2. Cross-reference assertion against PR diff to decide test-broken vs. code-broken.
  3. Append diagnosis + suggested fix to the originating issue's body.
  4. Close the PR (state=closed), preserve the branch, return the issue to the pool.

All four are mechanical given the failing-step text. None require the model to re-attempt the work — only to triage and re-queue.

Suggestions — what supervisor can check every 20min

Supervisor already runs every 1200s. Add an abandoned-PR sweeper before any other probe:

1. Find abandoned PRs.

GET /api/v1/repos/disinto-admin/disinto/pulls?state=open

For each open PR, check the linked issue (parse Fixes #N / Closes #N from the body). If the issue carries label blocked: ci_exhausted and the PR's latest pipeline status is failure and the PR has no human-author commits in the last 30min → candidate for sweep.

2. Harvest the failing step.

GET /api/repos/2/pipelines/<N>            # find failed workflow + step uuid
sqlite3 /var/lib/woodpecker/woodpecker.sqlite \
  "SELECT data FROM log_entries WHERE step_id=(SELECT id FROM steps WHERE uuid='<UUID>') ORDER BY line"

Extract the last 50 lines of the failing step. Feed them + the PR diff into a small triage prompt:

"Given this failing CI output and this PR diff, is the failure caused by (a) the production code under test, (b) the test harness itself (setup/teardown/order), (c) the Woodpecker config (YAML parse error, allow-list rejection, missing secret ref — pipeline errored before any step ran), or (d) infra flake? Reply with one of {code, test-harness, ci-config, flake} and a one-line reason."

Pipeline state discriminates the two infra-ish classes: error (no steps ran) points hard at ci-config; failure with sub-60s duration points at flake.

3. Act on the verdict.

  • flake → restart disinto-woodpecker-agent + retrigger pipeline (#867 path).
  • ci-config → close the PR (keep branch), append the parse-error + line number + fix suggestion to the originating issue's body under a ## Prior art: PR #N section. PRs in this class never post commit statuses back to Forgejo, so they show a blank CI badge rather than a red X — the sweeper must probe Woodpecker directly for error-state pipelines, not rely on the Forgejo /commits/<sha>/status endpoint (which returns empty for this class). Return to pool.
  • test-harness → close the PR (keep branch), append the failing-step extract + the triage one-liner to the originating issue's body under a ## Prior art: PR #N section, unassign + remove blocked label, return to pool.
  • code → leave PR open, post a comment on the issue summarizing the failure for the next claimant, unassign + remove blocked so the next dev agent picks it up cleanly. (Don't auto-close — code-failure PRs are partial progress worth keeping reviewable.)

4. Idempotency guard.
Supervisor must mark touched PRs/issues so it doesn't sweep them twice. A trailing comment from supervisor-bot containing <!-- supervisor-swept --> is enough.

Why this matters

The migration backlog has the shape of "many small agents.* config issues, each one PR-sized." Every one of them carries non-zero risk of dev-qwen producing a structurally-broken test that exhausts budget legitimately. Without a sweeper, every such mis-step costs an operator 20–30min of manual triage. With a sweeper, the cost is one supervisor tick (~30s) and the issue is back in the pool with strictly more context than it had before.

This is the cheapest reachable improvement to throughput before cutover. It also generalizes: the same harvest-and-triage primitive is what an eventual auto-reviewer-bot would call.

Fix for this specific incident (already applied manually)

  • Closed PR #872 (kept branch fix/issue-850 at db009e3).
  • Appended root-cause + one-line fix suggestion to issue #850 body under a ## Prior art: PR #872 section.
  • Added closing comment on the PR pointing back to the issue.
  • Issue is now back in pool but not unassigned from dev-qwen and not un-blocked — leaving that for the supervisor sweeper once it exists, so we can validate the sweeper against a real pre-existing case.

Second validating case: PR #896 / issue #884 (2026-04-16 16:46–18:04)

Same signature, ci-config branch: dev-qwen opened PR #896 for issue #884, rewrote .woodpecker/nomad-validate.yml with an inlined cat <<EOF heredoc whose body started at column 0 inside a - | block scalar. Pipelines #1037/#1038 errored with yaml: line 301: could not find expected ':' — pre-step, no Forgejo status posted, PR UI showed blank. dev-qwen exited ci_timeout; issue sat blocked until manual sweep. Resolution mirrored #872: close PR, retain branch fix/issue-884 at 108b928c, append diagnosis + cp lib/secret-scan.sh /tmp/ suggestion to #884 body, leave blocked + assigned for future sweeper validation. Validates the need for the ci-config verdict branch added above.

Non-goals

  • Not trying to make dev-qwen write better smoke tests. The class of bug ("test harness depends on prior-section pre-state") is hard to predict without execution; cheaper to catch downstream.
  • Not building a generic CI-doctor. This is specifically unblock abandoned PRs, not diagnose every red build.
  • Not coupling the sweeper to the pr-lifecycle budget. Supervisor probes the outcome (issue stuck in blocked with an open PR), not the agent's internal state.

Labels / meta

  • vision — supervisor capability we don't have yet, distinct from any single bug.
  • Cross-references: #850 (the underlying issue, now waiting in pool), #867 (sibling incident; same supervisor-tick layer, different verdict branch).
## Incident **2026-04-16 13:22–16:17 UTC.** dev-qwen claimed issue #850 (compose generator dup-detection), opened PR #872 with a working `lib/generators.sh` change + a passing unit test (`tests/test-duplicate-service-detection.sh`) + a new section 8/8 in `tests/smoke-init.sh`. The smoke-test addition was structurally broken: it re-invokes `bin/disinto init` after sections 1–7 have already materialized `docker-compose.yml`, so `_generate_compose_impl` hits its early-return at `lib/generators.sh:298` (`"Compose: already exists, skipping"`) and never reaches the new dup-check. The grep for `Duplicate service name 'agents-llama'` in step-8 output finds nothing → FAIL. **Blast radius:** dev-qwen marked the issue `ci_exhausted` at 13:55:33 — 38 seconds after the failing pipeline #1001 finished at 13:55:20 — and moved on. Issue and PR sat stuck for **2h 22m** until an operator (johba, via Claude) manually diagnosed, harvested the diff, closed the PR with prior-art reference, and returned the issue to the pool with a one-line fix suggestion (`rm -f docker-compose.yml` before the step-8 init). Nothing in the supervisor loop noticed. ## Why this stayed stuck This is not the #867 pattern (infra flake burning budget). The CI failure here was **legitimate and reproducible** — the test as written cannot pass against the code as written. dev-qwen's `pr-lifecycle` budget exhausted correctly; the agent self-reported `ci_exhausted` correctly. The gap is downstream: **once an issue lands in `blocked: ci_exhausted` with an open PR, no supervisor probe attempts to unblock it.** It just sits. The manual recovery this time was four moves: 1. Read the failing CI step's logs and identify the failing assertion. 2. Cross-reference assertion against PR diff to decide *test-broken* vs. *code-broken*. 3. Append diagnosis + suggested fix to the originating issue's body. 4. Close the PR (state=closed), preserve the branch, return the issue to the pool. All four are mechanical given the failing-step text. None require the model to re-attempt the work — only to triage and re-queue. ## Suggestions — what supervisor can check every 20min Supervisor already runs every `1200s`. Add an **abandoned-PR sweeper** before any other probe: **1. Find abandoned PRs.** ``` GET /api/v1/repos/disinto-admin/disinto/pulls?state=open ``` For each open PR, check the linked issue (parse `Fixes #N` / `Closes #N` from the body). If the issue carries label `blocked: ci_exhausted` and the PR's latest pipeline status is `failure` and the PR has no human-author commits in the last 30min → candidate for sweep. **2. Harvest the failing step.** ``` GET /api/repos/2/pipelines/<N> # find failed workflow + step uuid sqlite3 /var/lib/woodpecker/woodpecker.sqlite \ "SELECT data FROM log_entries WHERE step_id=(SELECT id FROM steps WHERE uuid='<UUID>') ORDER BY line" ``` Extract the last 50 lines of the failing step. Feed them + the PR diff into a small triage prompt: > "Given this failing CI output and this PR diff, is the failure caused by (a) the production code under test, (b) the test harness itself (setup/teardown/order), (c) the Woodpecker config (YAML parse error, allow-list rejection, missing secret ref — pipeline errored before any step ran), or (d) infra flake? Reply with one of {code, test-harness, ci-config, flake} and a one-line reason." Pipeline state discriminates the two infra-ish classes: `error` (no steps ran) points hard at `ci-config`; `failure` with sub-60s duration points at `flake`. **3. Act on the verdict.** - **`flake`** → restart `disinto-woodpecker-agent` + retrigger pipeline (#867 path). - **`ci-config`** → close the PR (keep branch), append the parse-error + line number + fix suggestion to the originating issue's body under a `## Prior art: PR #N` section. PRs in this class never post commit statuses back to Forgejo, so they show a blank CI badge rather than a red X — the sweeper *must* probe Woodpecker directly for `error`-state pipelines, not rely on the Forgejo `/commits/<sha>/status` endpoint (which returns empty for this class). Return to pool. - **`test-harness`** → close the PR (keep branch), append the failing-step extract + the triage one-liner to the originating issue's body under a `## Prior art: PR #N` section, unassign + remove `blocked` label, return to pool. - **`code`** → leave PR open, post a comment on the issue summarizing the failure for the next claimant, unassign + remove `blocked` so the next dev agent picks it up cleanly. (Don't auto-close — code-failure PRs are partial progress worth keeping reviewable.) **4. Idempotency guard.** Supervisor must mark touched PRs/issues so it doesn't sweep them twice. A trailing comment from `supervisor-bot` containing `<!-- supervisor-swept -->` is enough. ## Why this matters The migration backlog has the shape of "many small `agents.*` config issues, each one PR-sized." Every one of them carries non-zero risk of dev-qwen producing a structurally-broken test that exhausts budget legitimately. Without a sweeper, every such mis-step costs an operator 20–30min of manual triage. With a sweeper, the cost is one supervisor tick (~30s) and the issue is back in the pool with strictly more context than it had before. This is the cheapest reachable improvement to throughput before cutover. It also generalizes: the same harvest-and-triage primitive is what an eventual `auto-reviewer-bot` would call. ## Fix for this specific incident (already applied manually) - Closed PR #872 (kept branch `fix/issue-850` at `db009e3`). - Appended root-cause + one-line fix suggestion to issue #850 body under a `## Prior art: PR #872` section. - Added closing comment on the PR pointing back to the issue. - Issue is now back in pool but **not** unassigned from dev-qwen and **not** un-`blocked` — leaving that for the supervisor sweeper once it exists, so we can validate the sweeper against a real pre-existing case. ### Second validating case: PR #896 / issue #884 (2026-04-16 16:46–18:04) Same signature, `ci-config` branch: dev-qwen opened PR #896 for issue #884, rewrote `.woodpecker/nomad-validate.yml` with an inlined `cat <<EOF` heredoc whose body started at column 0 inside a `- |` block scalar. Pipelines #1037/#1038 errored with `yaml: line 301: could not find expected ':'` — pre-step, no Forgejo status posted, PR UI showed blank. dev-qwen exited `ci_timeout`; issue sat `blocked` until manual sweep. Resolution mirrored #872: close PR, retain branch `fix/issue-884` at `108b928c`, append diagnosis + `cp lib/secret-scan.sh /tmp/` suggestion to #884 body, leave `blocked` + assigned for future sweeper validation. Validates the need for the `ci-config` verdict branch added above. ## Non-goals - Not trying to make dev-qwen write better smoke tests. The class of bug ("test harness depends on prior-section pre-state") is hard to predict without execution; cheaper to catch downstream. - Not building a generic CI-doctor. This is specifically *unblock abandoned PRs*, not *diagnose every red build*. - Not coupling the sweeper to the `pr-lifecycle` budget. Supervisor probes the *outcome* (issue stuck in `blocked` with an open PR), not the agent's internal state. ## Labels / meta - `vision` — supervisor capability we don't have yet, distinct from any single bug. - Cross-references: #850 (the underlying issue, now waiting in pool), #867 (sibling incident; same supervisor-tick layer, different verdict branch).
dev-bot added the
vision
label 2026-04-16 16:32:36 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#894
No description provided.