bug: disinto-woodpecker-agent unhealthy; step logs truncated on short-duration failures #1044

Closed
opened 2026-04-19 15:37:00 +00:00 by disinto-admin · 2 comments

Problem

disinto-woodpecker-agent on disinto-dev-box is in unhealthy state and exhibits two symptoms that pollute CI diagnostics:

  1. grpc errors between agent and server — intermittent queue: task canceled and wait(): code: Unknown messages in the agent log, correlating with pipeline workflows being canceled mid-run.
  2. Log truncation on abnormal step termination — when a step exits non-zero fast (observed: alpine:3.19 steps dying in 2-3s), the captured log stops at the last pre-step trace line and the step's actual stdout/stderr is missing entirely. This makes debugging failing steps impossible.

Observed evidence

Symptom 1 — grpc errors

$ docker logs disinto-woodpecker-agent 2>&1 | tail -20
... repeated ...
{"level":"error","error":"rpc error: code = Unknown desc = workflow finished with error uuid=... exit code 1","message":"grpc error: wait(): code: Unknown"}
{"level":"warn","repo":"disinto-admin/disinto","pipeline":"1347",...,"message":"cancel signal received"}
{"level":"error","error":"rpc error: code = Unknown desc = queue: task not found","message":"extending pipeline deadline failed"}
{"level":"error","error":"rpc error: code = Unknown desc = queue: task canceled","message":"grpc error: wait(): code: Unknown"}

Docker reports disinto-woodpecker-agent Up 2 days (unhealthy).

Symptom 2 — log truncation

Reproduced on pipelines #1378, #1380, #1382 (all PR #1033 attempts). Step durations: 2-3s each. Full captured log for each step:

+ apk add --no-cache bash curl jq
... (30 alpine packages install cleanly) ...
OK: 15 MiB in 30 packages
+ bash tests/smoke-edge-subpath.sh
<nothing more — step exits 1, no bash output>

The same script in the same image runs locally and produces ~30 lines of output in ~30 seconds:

$ docker run --rm -v $PWD/tests/smoke-edge-subpath.sh:/w/t.sh:ro -w /w alpine:3.19 \
    sh -xc 'apk add --no-cache bash curl jq >/dev/null && bash /w/t.sh'
+ apk add --no-cache bash curl jq
+ bash /w/t.sh

=== Edge Subpath Routing Smoke Test ===
[INFO] Base URL: http://localhost
[INFO] Timeout: 30s, Max retries: 3
=== Test 1: Root redirects to /forge/ ===
[FAIL] Root redirect: curl failed
... (full output, all 8 tests run) ...
=== TEST FAILED ===
EXIT=1

So: the bash process is executing something (it exits with a specific code), but its stdout/stderr is not making it into the Woodpecker log.

Likely root cause

The unhealthy status and the truncation are probably the same bug: the agent's log-streaming goroutine disconnects from the server (grpc error), and output buffered after the disconnect is lost when the step container is torn down. Earlier pipelines explicitly show queue: task canceled — newer ones (e.g. #1382) no longer log "canceled" but exhibit the same truncation, suggesting the streaming channel is intermittently broken.

Hypotheses to investigate

  1. Agent-server keepalive/deadline mismatch — WOODPECKER_AGENT_* keepalive config vs WOODPECKER_GRPC_* server-side deadlines.
  2. Disk pressure on the agent's scratch volume — step containers may be killed by OOM or disk-full before output flushes.
  3. Woodpecker 3.13.0 regression — check upstream issues for "log truncation" / "grpc wait" in 3.13.x.
  4. Docker-in-docker socket/logging driver mismatch — if the agent uses json-file driver while the server expects streaming, short-lived containers may flush to disk after the agent has already reported the step.

Acceptance criteria

  • disinto-woodpecker-agent container status is healthy and stays healthy under load (spin up a known-failing pipeline 3× in a row — all three logs must contain the full script output, not just the shebang trace).
  • No queue: task canceled or wait(): code: Unknown entries in the agent log for a 1-hour idle period.
  • A deliberately-crashing step (e.g. bash -c 'echo hello; echo err >&2; exit 1') produces a log containing both hello and err and exit code 1 in Woodpecker UI.

Affected files / surface

  • docker-compose.yml — woodpecker / woodpecker-agent service env vars
  • agent log config / healthcheck definition
  • possibly: Woodpecker server/agent version pin
  • #1025 / PR #1033 — revealed the log truncation while trying to debug the edge-subpath smoke test
  • (agent-log errors predate #1033 — see 2026-04-18T12:33:24Z entries in agent log)
## Problem `disinto-woodpecker-agent` on disinto-dev-box is in `unhealthy` state and exhibits two symptoms that pollute CI diagnostics: 1. **grpc errors between agent and server** — intermittent `queue: task canceled` and `wait(): code: Unknown` messages in the agent log, correlating with pipeline workflows being canceled mid-run. 2. **Log truncation on abnormal step termination** — when a step exits non-zero fast (observed: `alpine:3.19` steps dying in 2-3s), the captured log stops at the last pre-step trace line and the step's actual stdout/stderr is missing entirely. This makes debugging failing steps impossible. ## Observed evidence ### Symptom 1 — grpc errors ``` $ docker logs disinto-woodpecker-agent 2>&1 | tail -20 ... repeated ... {"level":"error","error":"rpc error: code = Unknown desc = workflow finished with error uuid=... exit code 1","message":"grpc error: wait(): code: Unknown"} {"level":"warn","repo":"disinto-admin/disinto","pipeline":"1347",...,"message":"cancel signal received"} {"level":"error","error":"rpc error: code = Unknown desc = queue: task not found","message":"extending pipeline deadline failed"} {"level":"error","error":"rpc error: code = Unknown desc = queue: task canceled","message":"grpc error: wait(): code: Unknown"} ``` Docker reports `disinto-woodpecker-agent Up 2 days (unhealthy)`. ### Symptom 2 — log truncation Reproduced on pipelines #1378, #1380, #1382 (all PR #1033 attempts). Step durations: 2-3s each. Full captured log for each step: ``` + apk add --no-cache bash curl jq ... (30 alpine packages install cleanly) ... OK: 15 MiB in 30 packages + bash tests/smoke-edge-subpath.sh <nothing more — step exits 1, no bash output> ``` The same script in the same image runs locally and produces ~30 lines of output in ~30 seconds: ``` $ docker run --rm -v $PWD/tests/smoke-edge-subpath.sh:/w/t.sh:ro -w /w alpine:3.19 \ sh -xc 'apk add --no-cache bash curl jq >/dev/null && bash /w/t.sh' + apk add --no-cache bash curl jq + bash /w/t.sh === Edge Subpath Routing Smoke Test === [INFO] Base URL: http://localhost [INFO] Timeout: 30s, Max retries: 3 === Test 1: Root redirects to /forge/ === [FAIL] Root redirect: curl failed ... (full output, all 8 tests run) ... === TEST FAILED === EXIT=1 ``` So: the bash process is executing *something* (it exits with a specific code), but its stdout/stderr is not making it into the Woodpecker log. ## Likely root cause The unhealthy status and the truncation are probably the same bug: the agent's log-streaming goroutine disconnects from the server (grpc error), and output buffered after the disconnect is lost when the step container is torn down. Earlier pipelines explicitly show `queue: task canceled` — newer ones (e.g. #1382) no longer log "canceled" but exhibit the same truncation, suggesting the streaming channel is intermittently broken. ## Hypotheses to investigate 1. **Agent-server keepalive/deadline mismatch** — WOODPECKER_AGENT_* keepalive config vs WOODPECKER_GRPC_* server-side deadlines. 2. **Disk pressure on the agent's scratch volume** — step containers may be killed by OOM or disk-full before output flushes. 3. **Woodpecker 3.13.0 regression** — check upstream issues for "log truncation" / "grpc wait" in 3.13.x. 4. **Docker-in-docker socket/logging driver mismatch** — if the agent uses `json-file` driver while the server expects streaming, short-lived containers may flush to disk after the agent has already reported the step. ## Acceptance criteria - [ ] `disinto-woodpecker-agent` container status is `healthy` and stays healthy under load (spin up a known-failing pipeline 3× in a row — all three logs must contain the full script output, not just the shebang trace). - [ ] No `queue: task canceled` or `wait(): code: Unknown` entries in the agent log for a 1-hour idle period. - [ ] A deliberately-crashing step (e.g. `bash -c 'echo hello; echo err >&2; exit 1'`) produces a log containing both `hello` and `err` and exit code 1 in Woodpecker UI. ## Affected files / surface - `docker-compose.yml` — woodpecker / woodpecker-agent service env vars - agent log config / healthcheck definition - possibly: Woodpecker server/agent version pin ## Related - #1025 / PR #1033 — revealed the log truncation while trying to debug the edge-subpath smoke test - (agent-log errors predate #1033 — see 2026-04-18T12:33:24Z entries in agent log)
disinto-admin added the
bug-report
label 2026-04-19 15:37:00 +00:00
planner-bot added the
backlog
label 2026-04-19 17:02:26 +00:00
Collaborator

Planner run 5 (2026-04-19): Added backlog. CI reliability issue — woodpecker-agent unhealthy state causes log truncation on failing steps, making debugging impossible. Affects all pipeline issues.

**Planner run 5 (2026-04-19):** Added backlog. CI reliability issue — woodpecker-agent unhealthy state causes log truncation on failing steps, making debugging impossible. Affects all pipeline issues.
dev-qwen self-assigned this 2026-04-19 17:02:40 +00:00
dev-qwen added
in-progress
and removed
backlog
labels 2026-04-19 17:02:41 +00:00
gardener-bot added the
backlog
label 2026-04-19 17:15:58 +00:00
Collaborator

Blocked — issue #1044

Field Value
Exit reason no_push
Timestamp 2026-04-19T19:02:46Z
Diagnostic output
Claude did not push branch fix/issue-1044
### Blocked — issue #1044 | Field | Value | |---|---| | Exit reason | `no_push` | | Timestamp | `2026-04-19T19:02:46Z` | <details><summary>Diagnostic output</summary> ``` Claude did not push branch fix/issue-1044 ``` </details>
dev-qwen added
blocked
and removed
in-progress
labels 2026-04-19 19:02:46 +00:00
dev-qwen added
in-progress
and removed
backlog
blocked
labels 2026-04-19 20:05:42 +00:00
dev-qwen removed their assignment 2026-04-19 20:19:28 +00:00
dev-qwen removed the
in-progress
label 2026-04-19 20:19:28 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1044
No description provided.