feat: per-workflow/per-step CI diagnostics in agent fix prompts (implements #1050) #1051

Closed
opened 2026-04-19 18:30:23 +00:00 by disinto-admin · 1 comment

Goal

Implement per-workflow/per-step CI-failure diagnostics in the agent's CI-fix prompt builder, so agents can diagnose and fix multi-workflow failures instead of burning their fix budget on the wrong thing.

Fixes the gap documented in bug report #1050. Full diagnosis, root-cause analysis, and fix sketch live there.

The change (summary — see #1050 for evidence and rationale)

In lib/pr-lifecycle.sh CI-failure block (~line 431-459):

  1. Walk pipeline.workflows[] — for each workflow with state == "failure", collect its failed children[] (the actual failing step).
  2. For each failed step, fetch GET /api/repos/{id}/logs/{pipeline_num}/{step_id} via woodpecker_api and tail 50 lines from just that step (not the pipeline-combined stream).
  3. Build the agent prompt with one section per failed workflow containing: workflow name, step name, exit code, annotated exit-code meaning (126/127/128 standard meanings), and the per-step tail.
  4. Emit a "Passing workflows (do not modify): " line so the agent doesn't waste context on healthy workflows.
  5. Add a ci_get_step_logs <pipeline_num> <step_id> helper in lib/ci-helpers.sh that does step-scoped log fetch (mirroring the existing ci_get_logs pattern).

Optional follow-up (separate issue if substantial): .disinto/ci-flakes.yml allowlist for known-flaky workflow:step pairs so the agent skips them in the "must fix" section. Current known flake: smoke-init:smoke-init (mock-Forgejo branch-index retry exhausts on PR builds — see evidence in #1050).

Acceptance criteria

  • On a PR with 3 failing workflows (reproduce with fixture from PR #1046's pipeline #1423 if still queryable; otherwise synthesize), the generated CI-fix prompt contains three distinct sections — one per failed workflow — each with workflow name, step name, exit code, annotated meaning, and step-local log tail.
  • Prompt includes a "Passing workflows (do not modify): X, Y" line when at least one workflow in the pipeline passed.
  • Exit codes 126 ("permission denied or not executable"), 127 ("command not found"), 128 ("invalid exit argument / signal+128") are annotated inline.
  • ci_get_step_logs <pipeline_num> <step_id> helper added to lib/ci-helpers.sh with doc comment matching the style of ci_get_logs.
  • No regression on single-workflow failures — prompt output for a single failed workflow is at least as informative as today's.
  • shellcheck clean.
  • Existing dev-agent loop still works end-to-end — spawn a test scenario (can be a fixture PR with a deliberately failing step) and confirm the agent gets the new prompt.

Affected files

  • lib/pr-lifecycle.sh — CI-failure prompt builder (~lines 431-459)
  • lib/ci-helpers.sh — add ci_get_step_logs helper
  • Possibly: lib/ci-helpers.sh:ci_commit_status — no change expected; mentioned so the agent doesn't trip over the error→failure mapping and think it needs changing
  • #1050 — bug report with full diagnosis and motivating evidence
  • #1044 — server-side step-log truncation (complementary; this issue assumes logs exist, #1044 ensures they exist)
  • #1025 — concrete blocked issue waiting on this fix
## Goal Implement per-workflow/per-step CI-failure diagnostics in the agent's CI-fix prompt builder, so agents can diagnose and fix multi-workflow failures instead of burning their fix budget on the wrong thing. Fixes the gap documented in bug report **#1050**. Full diagnosis, root-cause analysis, and fix sketch live there. ## The change (summary — see #1050 for evidence and rationale) In `lib/pr-lifecycle.sh` CI-failure block (~line 431-459): 1. Walk `pipeline.workflows[]` — for each workflow with `state == "failure"`, collect its failed `children[]` (the actual failing step). 2. For each failed step, fetch `GET /api/repos/{id}/logs/{pipeline_num}/{step_id}` via `woodpecker_api` and tail 50 lines from **just that step** (not the pipeline-combined stream). 3. Build the agent prompt with one section per failed workflow containing: workflow name, step name, exit code, annotated exit-code meaning (126/127/128 standard meanings), and the per-step tail. 4. Emit a "Passing workflows (do not modify): <list>" line so the agent doesn't waste context on healthy workflows. 5. Add a `ci_get_step_logs <pipeline_num> <step_id>` helper in `lib/ci-helpers.sh` that does step-scoped log fetch (mirroring the existing `ci_get_logs` pattern). Optional follow-up (separate issue if substantial): `.disinto/ci-flakes.yml` allowlist for known-flaky `workflow:step` pairs so the agent skips them in the "must fix" section. Current known flake: `smoke-init:smoke-init` (mock-Forgejo branch-index retry exhausts on PR builds — see evidence in #1050). ## Acceptance criteria - [ ] On a PR with 3 failing workflows (reproduce with fixture from PR #1046's pipeline #1423 if still queryable; otherwise synthesize), the generated CI-fix prompt contains three distinct sections — one per failed workflow — each with workflow name, step name, exit code, annotated meaning, and step-local log tail. - [ ] Prompt includes a "Passing workflows (do not modify): X, Y" line when at least one workflow in the pipeline passed. - [ ] Exit codes 126 ("permission denied or not executable"), 127 ("command not found"), 128 ("invalid exit argument / signal+128") are annotated inline. - [ ] `ci_get_step_logs <pipeline_num> <step_id>` helper added to `lib/ci-helpers.sh` with doc comment matching the style of `ci_get_logs`. - [ ] No regression on single-workflow failures — prompt output for a single failed workflow is at least as informative as today's. - [ ] `shellcheck` clean. - [ ] Existing dev-agent loop still works end-to-end — spawn a test scenario (can be a fixture PR with a deliberately failing step) and confirm the agent gets the new prompt. ## Affected files - `lib/pr-lifecycle.sh` — CI-failure prompt builder (~lines 431-459) - `lib/ci-helpers.sh` — add `ci_get_step_logs` helper - Possibly: `lib/ci-helpers.sh:ci_commit_status` — no change expected; mentioned so the agent doesn't trip over the error→failure mapping and think it needs changing ## Related - #1050 — bug report with full diagnosis and motivating evidence - #1044 — server-side step-log truncation (complementary; this issue assumes logs exist, #1044 ensures they exist) - #1025 — concrete blocked issue waiting on this fix
disinto-admin added the
backlog
label 2026-04-19 18:30:23 +00:00
dev-bot self-assigned this 2026-04-19 18:30:33 +00:00
dev-bot added
in-progress
and removed
backlog
labels 2026-04-19 18:30:33 +00:00
Collaborator

Blocked — issue #1051

Field Value
Exit reason ci_exhausted_poll (3 attempts, PR #1052)
Timestamp 2026-04-19T18:39:49Z
### Blocked — issue #1051 | Field | Value | |---|---| | Exit reason | `ci_exhausted_poll (3 attempts, PR #1052)` | | Timestamp | `2026-04-19T18:39:49Z` |
dev-qwen2 added
blocked
and removed
in-progress
labels 2026-04-19 18:39:49 +00:00
dev-bot removed their assignment 2026-04-19 19:11:54 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1051
No description provided.