bug: disinto-woodpecker-agent unhealthy; step logs truncated on short-duration failures

disinto-admin commented

2026-04-19 15:37:00 +00:00

Owner

Problem

disinto-woodpecker-agent on disinto-dev-box is in unhealthy state and exhibits two symptoms that pollute CI diagnostics:

grpc errors between agent and server — intermittent queue: task canceled and wait(): code: Unknown messages in the agent log, correlating with pipeline workflows being canceled mid-run.
Log truncation on abnormal step termination — when a step exits non-zero fast (observed: alpine:3.19 steps dying in 2-3s), the captured log stops at the last pre-step trace line and the step's actual stdout/stderr is missing entirely. This makes debugging failing steps impossible.

Observed evidence

Symptom 1 — grpc errors

$ docker logs disinto-woodpecker-agent 2>&1 | tail -20
... repeated ...
{"level":"error","error":"rpc error: code = Unknown desc = workflow finished with error uuid=... exit code 1","message":"grpc error: wait(): code: Unknown"}
{"level":"warn","repo":"disinto-admin/disinto","pipeline":"1347",...,"message":"cancel signal received"}
{"level":"error","error":"rpc error: code = Unknown desc = queue: task not found","message":"extending pipeline deadline failed"}
{"level":"error","error":"rpc error: code = Unknown desc = queue: task canceled","message":"grpc error: wait(): code: Unknown"}

Docker reports disinto-woodpecker-agent Up 2 days (unhealthy).

Symptom 2 — log truncation

Reproduced on pipelines #1378, #1380, #1382 (all PR #1033 attempts). Step durations: 2-3s each. Full captured log for each step:

+ apk add --no-cache bash curl jq
... (30 alpine packages install cleanly) ...
OK: 15 MiB in 30 packages
+ bash tests/smoke-edge-subpath.sh
<nothing more — step exits 1, no bash output>

The same script in the same image runs locally and produces ~30 lines of output in ~30 seconds:

$ docker run --rm -v $PWD/tests/smoke-edge-subpath.sh:/w/t.sh:ro -w /w alpine:3.19 \
    sh -xc 'apk add --no-cache bash curl jq >/dev/null && bash /w/t.sh'
+ apk add --no-cache bash curl jq
+ bash /w/t.sh

=== Edge Subpath Routing Smoke Test ===
[INFO] Base URL: http://localhost
[INFO] Timeout: 30s, Max retries: 3
=== Test 1: Root redirects to /forge/ ===
[FAIL] Root redirect: curl failed
... (full output, all 8 tests run) ...
=== TEST FAILED ===
EXIT=1

So: the bash process is executing something (it exits with a specific code), but its stdout/stderr is not making it into the Woodpecker log.

Likely root cause

The unhealthy status and the truncation are probably the same bug: the agent's log-streaming goroutine disconnects from the server (grpc error), and output buffered after the disconnect is lost when the step container is torn down. Earlier pipelines explicitly show queue: task canceled — newer ones (e.g. #1382) no longer log "canceled" but exhibit the same truncation, suggesting the streaming channel is intermittently broken.

Hypotheses to investigate

Agent-server keepalive/deadline mismatch — WOODPECKER_AGENT_* keepalive config vs WOODPECKER_GRPC_* server-side deadlines.
Disk pressure on the agent's scratch volume — step containers may be killed by OOM or disk-full before output flushes.
Woodpecker 3.13.0 regression — check upstream issues for "log truncation" / "grpc wait" in 3.13.x.
Docker-in-docker socket/logging driver mismatch — if the agent uses json-file driver while the server expects streaming, short-lived containers may flush to disk after the agent has already reported the step.

Acceptance criteria

disinto-woodpecker-agent container status is healthy and stays healthy under load (spin up a known-failing pipeline 3× in a row — all three logs must contain the full script output, not just the shebang trace).
No queue: task canceled or wait(): code: Unknown entries in the agent log for a 1-hour idle period.
A deliberately-crashing step (e.g. bash -c 'echo hello; echo err >&2; exit 1') produces a log containing both hello and err and exit code 1 in Woodpecker UI.

Affected files / surface

docker-compose.yml — woodpecker / woodpecker-agent service env vars
agent log config / healthcheck definition
possibly: Woodpecker server/agent version pin

#1025 / PR #1033 — revealed the log truncation while trying to debug the edge-subpath smoke test
(agent-log errors predate #1033 — see 2026-04-18T12:33:24Z entries in agent log)

## Problem `disinto-woodpecker-agent` on disinto-dev-box is in `unhealthy` state and exhibits two symptoms that pollute CI diagnostics: 1. **grpc errors between agent and server** — intermittent `queue: task canceled` and `wait(): code: Unknown` messages in the agent log, correlating with pipeline workflows being canceled mid-run. 2. **Log truncation on abnormal step termination** — when a step exits non-zero fast (observed: `alpine:3.19` steps dying in 2-3s), the captured log stops at the last pre-step trace line and the step's actual stdout/stderr is missing entirely. This makes debugging failing steps impossible. ## Observed evidence ### Symptom 1 — grpc errors ``` $ docker logs disinto-woodpecker-agent 2>&1 | tail -20 ... repeated ... {"level":"error","error":"rpc error: code = Unknown desc = workflow finished with error uuid=... exit code 1","message":"grpc error: wait(): code: Unknown"} {"level":"warn","repo":"disinto-admin/disinto","pipeline":"1347",...,"message":"cancel signal received"} {"level":"error","error":"rpc error: code = Unknown desc = queue: task not found","message":"extending pipeline deadline failed"} {"level":"error","error":"rpc error: code = Unknown desc = queue: task canceled","message":"grpc error: wait(): code: Unknown"} ``` Docker reports `disinto-woodpecker-agent Up 2 days (unhealthy)`. ### Symptom 2 — log truncation Reproduced on pipelines #1378, #1380, #1382 (all PR #1033 attempts). Step durations: 2-3s each. Full captured log for each step: ``` + apk add --no-cache bash curl jq ... (30 alpine packages install cleanly) ... OK: 15 MiB in 30 packages + bash tests/smoke-edge-subpath.sh <nothing more — step exits 1, no bash output> ``` The same script in the same image runs locally and produces ~30 lines of output in ~30 seconds: ``` $ docker run --rm -v $PWD/tests/smoke-edge-subpath.sh:/w/t.sh:ro -w /w alpine:3.19 \ sh -xc 'apk add --no-cache bash curl jq >/dev/null && bash /w/t.sh' + apk add --no-cache bash curl jq + bash /w/t.sh === Edge Subpath Routing Smoke Test === [INFO] Base URL: http://localhost [INFO] Timeout: 30s, Max retries: 3 === Test 1: Root redirects to /forge/ === [FAIL] Root redirect: curl failed ... (full output, all 8 tests run) ... === TEST FAILED === EXIT=1 ``` So: the bash process is executing *something* (it exits with a specific code), but its stdout/stderr is not making it into the Woodpecker log. ## Likely root cause The unhealthy status and the truncation are probably the same bug: the agent's log-streaming goroutine disconnects from the server (grpc error), and output buffered after the disconnect is lost when the step container is torn down. Earlier pipelines explicitly show `queue: task canceled` — newer ones (e.g. #1382) no longer log "canceled" but exhibit the same truncation, suggesting the streaming channel is intermittently broken. ## Hypotheses to investigate 1. **Agent-server keepalive/deadline mismatch** — WOODPECKER_AGENT_* keepalive config vs WOODPECKER_GRPC_* server-side deadlines. 2. **Disk pressure on the agent's scratch volume** — step containers may be killed by OOM or disk-full before output flushes. 3. **Woodpecker 3.13.0 regression** — check upstream issues for "log truncation" / "grpc wait" in 3.13.x. 4. **Docker-in-docker socket/logging driver mismatch** — if the agent uses `json-file` driver while the server expects streaming, short-lived containers may flush to disk after the agent has already reported the step. ## Acceptance criteria - [ ] `disinto-woodpecker-agent` container status is `healthy` and stays healthy under load (spin up a known-failing pipeline 3× in a row — all three logs must contain the full script output, not just the shebang trace). - [ ] No `queue: task canceled` or `wait(): code: Unknown` entries in the agent log for a 1-hour idle period. - [ ] A deliberately-crashing step (e.g. `bash -c 'echo hello; echo err >&2; exit 1'`) produces a log containing both `hello` and `err` and exit code 1 in Woodpecker UI. ## Affected files / surface - `docker-compose.yml` — woodpecker / woodpecker-agent service env vars - agent log config / healthcheck definition - possibly: Woodpecker server/agent version pin ## Related - #1025 / PR #1033 — revealed the log truncation while trying to debug the edge-subpath smoke test - (agent-log errors predate #1033 — see 2026-04-18T12:33:24Z entries in agent log)

disinto-admin added the

bug-report

label 2026-04-19 15:37:00 +00:00

disinto-admin referenced this issue

2026-04-19 15:37:33 +00:00

vision(#623): end-to-end subpath routing smoke test for Forgejo + Woodpecker + chat #1025

planner-bot added the

backlog

label 2026-04-19 17:02:26 +00:00

planner-bot commented

2026-04-19 17:02:27 +00:00

Collaborator

Planner run 5 (2026-04-19): Added backlog. CI reliability issue — woodpecker-agent unhealthy state causes log truncation on failing steps, making debugging impossible. Affects all pipeline issues.

**Planner run 5 (2026-04-19):** Added backlog. CI reliability issue — woodpecker-agent unhealthy state causes log truncation on failing steps, making debugging impossible. Affects all pipeline issues.

dev-qwen self-assigned this 2026-04-19 17:02:40 +00:00

dev-qwen added

in-progress

and removed

backlog

labels 2026-04-19 17:02:41 +00:00

gardener-bot referenced this issue

2026-04-19 17:03:14 +00:00

chore: gardener housekeeping 2026-04-19 #1048

planner-bot referenced this issue

2026-04-19 17:03:46 +00:00

bug: ops repo branch protection blocks all agent writes — prerequisites.md stuck since run 2 #758

review-bot referenced this issue

2026-04-19 17:13:38 +00:00

chore: gardener housekeeping 2026-04-19 #1048

gardener-bot added the

backlog

label 2026-04-19 17:15:58 +00:00

disinto-admin referenced this issue

2026-04-19 17:37:44 +00:00

bug: agent CI-failure prompt impoverished — per-workflow/per-step diagnostics missing, exit codes unannotated, flakes mixed with real failures #1050

disinto-admin referenced this issue

2026-04-19 18:30:23 +00:00

feat: per-workflow/per-step CI diagnostics in agent fix prompts (implements #1050) #1051

dev-qwen commented

2026-04-19 19:02:46 +00:00

Collaborator

Blocked — issue #1044

Field	Value
Exit reason	`no_push`
Timestamp	`2026-04-19T19:02:46Z`

Diagnostic output

Claude did not push branch fix/issue-1044

### Blocked — issue #1044 | Field | Value | |---|---| | Exit reason | `no_push` | | Timestamp | `2026-04-19T19:02:46Z` | <details><summary>Diagnostic output</summary> ``` Claude did not push branch fix/issue-1044 ``` </details>

dev-qwen added

blocked

and removed

in-progress

labels 2026-04-19 19:02:46 +00:00

disinto-admin referenced this issue

2026-04-19 19:45:34 +00:00

bug: claude_run_with_watchdog leaks orphan bash children — review-pr.sh lock stuck for 47 min when Claude Bash-tool command hangs #1055

dev-qwen added

in-progress

and removed

backlog