bug: agent CI-failure prompt impoverished — per-workflow/per-step diagnostics missing, exit codes unannotated, flakes mixed with real failures #1050

New issue

Open

opened 2026-04-19 17:37:44 +00:00 by disinto-admin · 0 comments

disinto-admin commented

2026-04-19 17:37:44 +00:00

Owner

Problem

When CI fails on an agent-authored PR, the agent's self-correction loop receives an impoverished view of the failure and cannot diagnose problems across multiple independent workflows. Agents burn their 3-attempt fix budget retrying the wrong fix or "fixing" problems they didn't cause.

Observed on PR #1046 (issue #1025): three workflows failed for three entirely different reasons (duplicate-detection exit 1, caddy-validate exit 127 "curl: not found", smoke-init exit 126 with pre-existing branch-index flakiness). Agent consumed all 3 inner attempts without fixing any of them, because the prompt it received could not surface per-workflow context.

Root cause

lib/pr-lifecycle.sh:431-457 builds the CI-fix prompt from two sources:

_PR_CI_ERROR_LOG — a single "error log snippet" set by pr_poll_ci
ci_get_logs "$_PR_CI_PIPELINE" | tail -50 — the last 50 lines of one combined log stream per pipeline

With 3 concurrent workflows each producing their own step logs, tail -50 of a combined stream shows whatever finished last — not the actual failure points. Even when per-workflow logs are queryable (the Woodpecker API supports /api/repos/{id}/logs/{pipeline}/{step_id} per step), the helper collapses them into one tail.

The agent also does not receive:

Which workflow/step failed (name, exit code)
Which workflows passed (so it doesn't try to fix passing ones)
Exit-code hints — 126 ≠ 127 ≠ 1 diagnostically; agents parse these as "CI failed" without distinction
Known-flake markers — pre-existing flaky workflows (e.g. smoke-init mock-Forgejo branch-index latency) get mixed in with real failures

Desired behavior

The CI-fix prompt given to Claude on attempt N/3 should contain, for each failed workflow:

Workflow: smoke-init (state: failure)
  └─ Step: smoke-init (exit 126 = "permission denied or not executable")
     Log (last 50 lines of this step):
     ```
     ... just this step's tail, not the pipeline combined stream ...
     ```

Workflow: edge-subpath (state: failure)
  └─ Step: caddy-validate (exit 127 = "command not found")
     Log (last 50 lines of this step):
     ```
     ... ...
     ```

Passing workflows (do NOT modify): ci (all steps pass)

Exit codes 126/127/128 should be annotated with their standard meanings. Flaky-workflow list (if maintained) should be excluded from the "required fix" set.

Fix sketch

In lib/pr-lifecycle.sh CI-failure block (line ~431-459):

For each workflow.state == "failure" in the pipeline:
- Fetch /api/repos/{id}/logs/{pipeline_num}/{step_id} for the specific failed step (not the whole workflow)
- Tail 50 lines from just that step
- Include step name, workflow name, exit code, and annotated exit-code meaning
Include a "passing workflows" line so the agent doesn't waste context trying to fix passing steps.
Optional: support a .disinto/ci-flakes.yml allowlist listing workflow:step pairs that should be treated as informational (agent warned, but not asked to fix).

Acceptance criteria

On a PR with 3 failing workflows, the CI-fix prompt contains a per-workflow section with step name, exit code, exit-code-meaning annotation, and step-local tail
On a PR with 1 failing + 2 passing workflows, the prompt lists passing workflows explicitly under "do not modify"
ci_get_logs helper in lib/ci-helpers.sh grows a --step-id variant that fetches per-step logs
Exit codes 126, 127, 128 are surfaced with standard meanings in the prompt
A regression test against PR #1046's pipeline #1423 content produces a prompt that cleanly separates the three failures (dup-detection / caddy-validate / smoke-init)

#1044 — woodpecker-agent unhealthy + step-log truncation on short-duration failures. Solves the "logs don't exist server-side" case; this issue solves the "logs exist but agent doesn't see them usefully" case. Both need to land for the agent to reliably self-correct.
#1047 — blocked label not cleared on re-claim; independent, but compounds this issue by hiding the diagnostic pass from dev-poll until after wasted attempts.

Blocks

#1025 — dev-qwen2 / dev-qwen have failed 5+ PR attempts on #1025 largely due to this diagnostic blindness. #1025 should be reassigned after this issue lands.

## Problem When CI fails on an agent-authored PR, the agent's self-correction loop receives an impoverished view of the failure and cannot diagnose problems across multiple independent workflows. Agents burn their 3-attempt fix budget retrying the wrong fix or "fixing" problems they didn't cause. Observed on PR #1046 (issue #1025): three workflows failed for three entirely different reasons (`duplicate-detection` exit 1, `caddy-validate` exit 127 "curl: not found", `smoke-init` exit 126 with pre-existing branch-index flakiness). Agent consumed all 3 inner attempts without fixing any of them, because the prompt it received could not surface per-workflow context. ## Root cause `lib/pr-lifecycle.sh:431-457` builds the CI-fix prompt from two sources: 1. `_PR_CI_ERROR_LOG` — a single "error log snippet" set by `pr_poll_ci` 2. `ci_get_logs "$_PR_CI_PIPELINE" | tail -50` — the last 50 lines of **one** combined log stream per pipeline With 3 concurrent workflows each producing their own step logs, `tail -50` of a combined stream shows whatever finished last — not the actual failure points. Even when per-workflow logs are queryable (the Woodpecker API supports `/api/repos/{id}/logs/{pipeline}/{step_id}` per step), the helper collapses them into one tail. The agent also does not receive: - **Which workflow/step failed** (name, exit code) - **Which workflows passed** (so it doesn't try to fix passing ones) - **Exit-code hints** — 126 ≠ 127 ≠ 1 diagnostically; agents parse these as "CI failed" without distinction - **Known-flake markers** — pre-existing flaky workflows (e.g. `smoke-init` mock-Forgejo branch-index latency) get mixed in with real failures ## Desired behavior The CI-fix prompt given to Claude on attempt N/3 should contain, for each **failed** workflow: ``` Workflow: smoke-init (state: failure) └─ Step: smoke-init (exit 126 = "permission denied or not executable") Log (last 50 lines of this step): ``` ... just this step's tail, not the pipeline combined stream ... ``` Workflow: edge-subpath (state: failure) └─ Step: caddy-validate (exit 127 = "command not found") Log (last 50 lines of this step): ``` ... ... ``` Passing workflows (do NOT modify): ci (all steps pass) ``` Exit codes 126/127/128 should be annotated with their standard meanings. Flaky-workflow list (if maintained) should be excluded from the "required fix" set. ## Fix sketch In `lib/pr-lifecycle.sh` CI-failure block (line ~431-459): 1. For each `workflow.state == "failure"` in the pipeline: - Fetch `/api/repos/{id}/logs/{pipeline_num}/{step_id}` for the specific failed step (not the whole workflow) - Tail 50 lines from just that step - Include step name, workflow name, exit code, and annotated exit-code meaning 2. Include a "passing workflows" line so the agent doesn't waste context trying to fix passing steps. 3. Optional: support a `.disinto/ci-flakes.yml` allowlist listing `workflow:step` pairs that should be treated as informational (agent warned, but not asked to fix). ## Acceptance criteria - [ ] On a PR with 3 failing workflows, the CI-fix prompt contains a per-workflow section with step name, exit code, exit-code-meaning annotation, and step-local tail - [ ] On a PR with 1 failing + 2 passing workflows, the prompt lists passing workflows explicitly under "do not modify" - [ ] `ci_get_logs` helper in `lib/ci-helpers.sh` grows a `--step-id` variant that fetches per-step logs - [ ] Exit codes 126, 127, 128 are surfaced with standard meanings in the prompt - [ ] A regression test against PR #1046's pipeline #1423 content produces a prompt that cleanly separates the three failures (dup-detection / caddy-validate / smoke-init) ## Related - #1044 — woodpecker-agent unhealthy + step-log truncation on short-duration failures. Solves the "logs don't exist server-side" case; this issue solves the "logs exist but agent doesn't see them usefully" case. Both need to land for the agent to reliably self-correct. - #1047 — `blocked` label not cleared on re-claim; independent, but compounds this issue by hiding the diagnostic pass from dev-poll until after wasted attempts. ## Blocks - #1025 — dev-qwen2 / dev-qwen have failed 5+ PR attempts on #1025 largely due to this diagnostic blindness. #1025 should be reassigned after this issue lands.