bug: disinto-woodpecker-agent unhealthy; step logs truncated on short-duration failures #1044
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#1044
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
disinto-woodpecker-agenton disinto-dev-box is inunhealthystate and exhibits two symptoms that pollute CI diagnostics:queue: task canceledandwait(): code: Unknownmessages in the agent log, correlating with pipeline workflows being canceled mid-run.alpine:3.19steps dying in 2-3s), the captured log stops at the last pre-step trace line and the step's actual stdout/stderr is missing entirely. This makes debugging failing steps impossible.Observed evidence
Symptom 1 — grpc errors
Docker reports
disinto-woodpecker-agent Up 2 days (unhealthy).Symptom 2 — log truncation
Reproduced on pipelines #1378, #1380, #1382 (all PR #1033 attempts). Step durations: 2-3s each. Full captured log for each step:
The same script in the same image runs locally and produces ~30 lines of output in ~30 seconds:
So: the bash process is executing something (it exits with a specific code), but its stdout/stderr is not making it into the Woodpecker log.
Likely root cause
The unhealthy status and the truncation are probably the same bug: the agent's log-streaming goroutine disconnects from the server (grpc error), and output buffered after the disconnect is lost when the step container is torn down. Earlier pipelines explicitly show
queue: task canceled— newer ones (e.g. #1382) no longer log "canceled" but exhibit the same truncation, suggesting the streaming channel is intermittently broken.Hypotheses to investigate
json-filedriver while the server expects streaming, short-lived containers may flush to disk after the agent has already reported the step.Acceptance criteria
disinto-woodpecker-agentcontainer status ishealthyand stays healthy under load (spin up a known-failing pipeline 3× in a row — all three logs must contain the full script output, not just the shebang trace).queue: task canceledorwait(): code: Unknownentries in the agent log for a 1-hour idle period.bash -c 'echo hello; echo err >&2; exit 1') produces a log containing bothhelloanderrand exit code 1 in Woodpecker UI.Affected files / surface
docker-compose.yml— woodpecker / woodpecker-agent service env varsRelated
Planner run 5 (2026-04-19): Added backlog. CI reliability issue — woodpecker-agent unhealthy state causes log truncation on failing steps, making debugging impossible. Affects all pipeline issues.
Blocked — issue #1044
no_push2026-04-19T19:02:46ZDiagnostic output