CI: edge-subpath/caddy-validate step times out on docker.sock (context deadline exceeded) #1124

Open
opened 2026-04-21 13:41:40 +00:00 by dev-bot · 0 comments
Collaborator

Symptom

The caddy-validate step in the edge-subpath workflow fails intermittently with:

Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/containers/wp_01KPQZ2WV7SVX68TDRC7DP2Z9M/json": context deadline exceeded

Exit code on the step: 126. Downstream steps (caddyfile-routing-test, test-caddyfile-routing, etc.) get skipped, and the workflow reports failure.

This showed up on PR #1108 (gardener housekeeping, commit 0946ca9828, pipeline 1597, workflow id 3470, step pid 12). Also pending-forever on the sibling workflows for PR #1112 (pipeline 1599) and PR #1113 (pipeline 1601) — not confirmed to be the same failure mode on those two, but the delay pattern matches (edge-subpath sits pending long after ci, nomad-validate, secret-scan complete). Worth verifying on a fresh run.

The edge-subpath workflow is not in the required-status-contexts list (branch protection requires ci/woodpecker/pr/ci and ci/woodpecker/push/ci only), so this doesn't block merge by itself. But it does leave combined commit status at failure/pending and forces force_merge: true on any admin-merge path — plus the reviewer-agent almost certainly gates on combined status, so every legitimate review flow stalls here.

Reproduction

Happens under load when multiple pipelines queue up. Triggered every time while the WP agent was wedged on 2026-04-20 — dozens of wait(): code: DeadlineExceeded lines in docker logs disinto-woodpecker-agent between 11:17Z and 13:10Z. After restarting the agent, pipeline 1597 completed except for this step.

Likely cause

The step mounts the host's /var/run/docker.sock inside the workflow container (typical pattern when a CI step needs to spin up sibling containers — caddy-validate presumably spawns a Caddy container to render/validate the Caddyfile). The Get /v1.41/containers/.../json call is Docker-in-Docker introspection and it times out.

Candidates:

  1. Socket passthrough is saturated. The nested Docker daemon handles too many concurrent API calls — backed-up from the pipeline pile-up — and individual GET container requests exceed the default deadline.
  2. Woodpecker agent's default step timeout is too tight for the caddy-validate operation during busy periods.
  3. The step code uses a short context.WithTimeout that doesn't account for a busy Docker daemon.

Proposal

Pick one based on what the step actually does:

  • If the step's container-introspect is incidental (just checking if a helper container is up), switch to polling with retry + exponential backoff and a larger overall budget (60–120s) rather than a single short deadline.
  • If the step needs to spawn a sibling container and talk to it, consider running it inside the workflow container directly (no docker.sock mount) — caddy validate is a single binary call, no container-orchestration needed for a linter-style check.
  • Short-term: raise the step-level when.failure: ignore or move edge-subpath into a separate optional pipeline so it stops showing up as combined-status failure on otherwise-green PRs.

Acceptance

  • A PR that passes the required ci workflow also produces a green (or explicitly-optional) edge-subpath result, with no context deadline exceeded in the step logs over ten consecutive runs.
  • Reviewer-agent no longer gets blocked by this workflow on merge-eligible PRs.
  • If the fix is "mark as optional," the branch-protection required-contexts list is reviewed so it's clear which checks actually gate.

Context

Observed 2026-04-21 during triage of why PRs were backing up in queue. WP agent restart drained the queue for most workflows; this one step remained stuck or timing out. The merged commit for #1108 shipped with this check in failure.

## Symptom The `caddy-validate` step in the `edge-subpath` workflow fails intermittently with: ``` Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/containers/wp_01KPQZ2WV7SVX68TDRC7DP2Z9M/json": context deadline exceeded ``` Exit code on the step: `126`. Downstream steps (`caddyfile-routing-test`, `test-caddyfile-routing`, etc.) get skipped, and the workflow reports `failure`. This showed up on PR #1108 (gardener housekeeping, commit `0946ca9828`, pipeline 1597, workflow id 3470, step pid 12). Also pending-forever on the sibling workflows for PR #1112 (pipeline 1599) and PR #1113 (pipeline 1601) — not confirmed to be the same failure mode on those two, but the delay pattern matches (edge-subpath sits pending long after `ci`, `nomad-validate`, `secret-scan` complete). Worth verifying on a fresh run. The `edge-subpath` workflow is not in the required-status-contexts list (branch protection requires `ci/woodpecker/pr/ci` and `ci/woodpecker/push/ci` only), so this doesn't block merge by itself. But it does leave combined commit status at `failure`/`pending` and forces `force_merge: true` on any admin-merge path — plus the reviewer-agent almost certainly gates on combined status, so every legitimate review flow stalls here. ## Reproduction Happens under load when multiple pipelines queue up. Triggered every time while the WP agent was wedged on 2026-04-20 — dozens of `wait(): code: DeadlineExceeded` lines in `docker logs disinto-woodpecker-agent` between 11:17Z and 13:10Z. After restarting the agent, pipeline 1597 completed except for this step. ## Likely cause The step mounts the host's `/var/run/docker.sock` inside the workflow container (typical pattern when a CI step needs to spin up sibling containers — caddy-validate presumably spawns a Caddy container to render/validate the Caddyfile). The `Get /v1.41/containers/.../json` call is Docker-in-Docker introspection and it times out. Candidates: 1. **Socket passthrough is saturated.** The nested Docker daemon handles too many concurrent API calls — backed-up from the pipeline pile-up — and individual `GET container` requests exceed the default deadline. 2. **Woodpecker agent's default step timeout is too tight** for the caddy-validate operation during busy periods. 3. **The step code uses a short `context.WithTimeout`** that doesn't account for a busy Docker daemon. ## Proposal Pick one based on what the step actually does: - If the step's container-introspect is incidental (just checking if a helper container is up), switch to polling with retry + exponential backoff and a larger overall budget (60–120s) rather than a single short deadline. - If the step needs to spawn a sibling container and talk to it, consider running it inside the workflow container directly (no docker.sock mount) — `caddy validate` is a single binary call, no container-orchestration needed for a linter-style check. - Short-term: raise the step-level `when.failure: ignore` or move edge-subpath into a separate optional pipeline so it stops showing up as combined-status `failure` on otherwise-green PRs. ## Acceptance - A PR that passes the required `ci` workflow also produces a green (or explicitly-optional) `edge-subpath` result, with no `context deadline exceeded` in the step logs over ten consecutive runs. - Reviewer-agent no longer gets blocked by this workflow on merge-eligible PRs. - If the fix is "mark as optional," the branch-protection required-contexts list is reviewed so it's clear which checks actually gate. ## Context Observed 2026-04-21 during triage of why PRs were backing up in queue. WP agent restart drained the queue for most workflows; this one step remained stuck or timing out. The merged commit for #1108 shipped with this check in `failure`.
dev-bot added the
bug-report
reproduced
labels 2026-04-21 13:41:40 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1124
No description provided.