fix: cap caddy-validate step timeout so it fails fast instead of hanging the whole queue (implements #1124) #1127

New issue

Closed

opened 2026-04-21 14:29:41 +00:00 by dev-bot · 1 comment

dev-bot commented

2026-04-21 14:29:41 +00:00

Collaborator

Implements fix for #1124 — the caddy-validate step in .woodpecker/edge-subpath.yml currently does curl -sS -o /tmp/caddy https://caddyserver.com/api/download?os=linux&arch=amd64 with no timeout. When that curl hangs (observed repeatedly 2026-04-21), the step runs until the Woodpecker agent's own ~1h deadline kicks in. With agent capacity=1 parallel pipeline, this blocks every other workflow in the queue.

Minimum-scope change (hotpatch)

In .woodpecker/edge-subpath.yml, change the caddy-validate step's curl to:

curl -fsSL --max-time 60 --connect-timeout 10 -o /tmp/caddy "https://caddyserver.com/api/download?os=linux&arch=amd64"

--max-time 60 caps total wall time at 60s
--connect-timeout 10 fails fast on network unreachable
-f makes HTTP errors non-zero (currently -sS would swallow 5xx)
-L follows redirects (future-proofing)

If the download fails, the step exits non-zero within ~60s and the workflow is marked failed. The queue moves on instead of waiting an hour.

Better follow-on (separate issue, low-priority)

Cache the caddy binary in a base image instead of downloading on every run. The current "download caddy on every CI run" pattern is fragile (external dep) and wasteful (~40MB per run × every PR). Build a small disinto/caddy-validate:latest image in docker/ with caddy pre-installed, push to the internal registry, reference it from the step. Track separately — not required for this unblock.

Acceptance

caddy-validate step completes or fails within ~90s under load, never runs past that wall.
The queue stops piling up behind this single step.
If the download itself is unreachable, the error message in the step log makes it obvious (-f + --max-time).

Context

Filed as follow-up to #1124 per the bug→backlog handoff pattern. #1124 has the full reproduction and diagnosis.

Observed 2026-04-21 at peak: 32 workflows pending, 1 running, worker_count=0 — single caddy-validate container held the entire agent for 55+ minutes. Unblock for that session was (a) cancelling stale pipelines for already-merged PRs, (b) letting the agent's 1h deadline expire on the stuck step. This issue makes that self-healing.

Implements fix for #1124 — the caddy-validate step in `.woodpecker/edge-subpath.yml` currently does `curl -sS -o /tmp/caddy https://caddyserver.com/api/download?os=linux&arch=amd64` with no timeout. When that curl hangs (observed repeatedly 2026-04-21), the step runs until the Woodpecker agent's own ~1h deadline kicks in. With agent capacity=1 parallel pipeline, this blocks every other workflow in the queue. ## Minimum-scope change (hotpatch) In `.woodpecker/edge-subpath.yml`, change the `caddy-validate` step's curl to: ```bash curl -fsSL --max-time 60 --connect-timeout 10 -o /tmp/caddy "https://caddyserver.com/api/download?os=linux&arch=amd64" ``` - `--max-time 60` caps total wall time at 60s - `--connect-timeout 10` fails fast on network unreachable - `-f` makes HTTP errors non-zero (currently `-sS` would swallow 5xx) - `-L` follows redirects (future-proofing) If the download fails, the step exits non-zero within ~60s and the workflow is marked failed. The queue moves on instead of waiting an hour. ## Better follow-on (separate issue, low-priority) Cache the caddy binary in a base image instead of downloading on every run. The current "download caddy on every CI run" pattern is fragile (external dep) and wasteful (~40MB per run × every PR). Build a small `disinto/caddy-validate:latest` image in `docker/` with caddy pre-installed, push to the internal registry, reference it from the step. Track separately — not required for this unblock. ## Acceptance - `caddy-validate` step completes or fails within ~90s under load, never runs past that wall. - The queue stops piling up behind this single step. - If the download itself is unreachable, the error message in the step log makes it obvious (`-f` + `--max-time`). ## Context Filed as follow-up to #1124 per the bug→backlog handoff pattern. #1124 has the full reproduction and diagnosis. Observed 2026-04-21 at peak: 32 workflows pending, 1 running, worker_count=0 — single caddy-validate container held the entire agent for 55+ minutes. Unblock for that session was (a) cancelling stale pipelines for already-merged PRs, (b) letting the agent's 1h deadline expire on the stuck step. This issue makes that self-healing.