fix: cap caddy-validate step timeout so it fails fast instead of hanging the whole queue (implements #1124) #1127

Closed
opened 2026-04-21 14:29:41 +00:00 by dev-bot · 1 comment
Collaborator

Implements fix for #1124 — the caddy-validate step in .woodpecker/edge-subpath.yml currently does curl -sS -o /tmp/caddy https://caddyserver.com/api/download?os=linux&arch=amd64 with no timeout. When that curl hangs (observed repeatedly 2026-04-21), the step runs until the Woodpecker agent's own ~1h deadline kicks in. With agent capacity=1 parallel pipeline, this blocks every other workflow in the queue.

Minimum-scope change (hotpatch)

In .woodpecker/edge-subpath.yml, change the caddy-validate step's curl to:

curl -fsSL --max-time 60 --connect-timeout 10 -o /tmp/caddy "https://caddyserver.com/api/download?os=linux&arch=amd64"
  • --max-time 60 caps total wall time at 60s
  • --connect-timeout 10 fails fast on network unreachable
  • -f makes HTTP errors non-zero (currently -sS would swallow 5xx)
  • -L follows redirects (future-proofing)

If the download fails, the step exits non-zero within ~60s and the workflow is marked failed. The queue moves on instead of waiting an hour.

Better follow-on (separate issue, low-priority)

Cache the caddy binary in a base image instead of downloading on every run. The current "download caddy on every CI run" pattern is fragile (external dep) and wasteful (~40MB per run × every PR). Build a small disinto/caddy-validate:latest image in docker/ with caddy pre-installed, push to the internal registry, reference it from the step. Track separately — not required for this unblock.

Acceptance

  • caddy-validate step completes or fails within ~90s under load, never runs past that wall.
  • The queue stops piling up behind this single step.
  • If the download itself is unreachable, the error message in the step log makes it obvious (-f + --max-time).

Context

Filed as follow-up to #1124 per the bug→backlog handoff pattern. #1124 has the full reproduction and diagnosis.

Observed 2026-04-21 at peak: 32 workflows pending, 1 running, worker_count=0 — single caddy-validate container held the entire agent for 55+ minutes. Unblock for that session was (a) cancelling stale pipelines for already-merged PRs, (b) letting the agent's 1h deadline expire on the stuck step. This issue makes that self-healing.

Implements fix for #1124 — the caddy-validate step in `.woodpecker/edge-subpath.yml` currently does `curl -sS -o /tmp/caddy https://caddyserver.com/api/download?os=linux&arch=amd64` with no timeout. When that curl hangs (observed repeatedly 2026-04-21), the step runs until the Woodpecker agent's own ~1h deadline kicks in. With agent capacity=1 parallel pipeline, this blocks every other workflow in the queue. ## Minimum-scope change (hotpatch) In `.woodpecker/edge-subpath.yml`, change the `caddy-validate` step's curl to: ```bash curl -fsSL --max-time 60 --connect-timeout 10 -o /tmp/caddy "https://caddyserver.com/api/download?os=linux&arch=amd64" ``` - `--max-time 60` caps total wall time at 60s - `--connect-timeout 10` fails fast on network unreachable - `-f` makes HTTP errors non-zero (currently `-sS` would swallow 5xx) - `-L` follows redirects (future-proofing) If the download fails, the step exits non-zero within ~60s and the workflow is marked failed. The queue moves on instead of waiting an hour. ## Better follow-on (separate issue, low-priority) Cache the caddy binary in a base image instead of downloading on every run. The current "download caddy on every CI run" pattern is fragile (external dep) and wasteful (~40MB per run × every PR). Build a small `disinto/caddy-validate:latest` image in `docker/` with caddy pre-installed, push to the internal registry, reference it from the step. Track separately — not required for this unblock. ## Acceptance - `caddy-validate` step completes or fails within ~90s under load, never runs past that wall. - The queue stops piling up behind this single step. - If the download itself is unreachable, the error message in the step log makes it obvious (`-f` + `--max-time`). ## Context Filed as follow-up to #1124 per the bug→backlog handoff pattern. #1124 has the full reproduction and diagnosis. Observed 2026-04-21 at peak: 32 workflows pending, 1 running, worker_count=0 — single caddy-validate container held the entire agent for 55+ minutes. Unblock for that session was (a) cancelling stale pipelines for already-merged PRs, (b) letting the agent's 1h deadline expire on the stuck step. This issue makes that self-healing.
dev-bot added the
backlog
label 2026-04-21 14:29:41 +00:00
dev-qwen self-assigned this 2026-04-21 14:37:00 +00:00
dev-qwen added
in-progress
and removed
backlog
labels 2026-04-21 14:37:00 +00:00
Collaborator

Blocked — issue #1127

Field Value
Exit reason ci_timeout
Timestamp 2026-04-21T15:07:45Z
### Blocked — issue #1127 | Field | Value | |---|---| | Exit reason | `ci_timeout` | | Timestamp | `2026-04-21T15:07:45Z` |
dev-qwen added
blocked
and removed
in-progress
labels 2026-04-21 15:07:46 +00:00
disinto-admin added
backlog
and removed
blocked
labels 2026-04-21 15:11:25 +00:00
dev-qwen was unassigned by disinto-admin 2026-04-21 15:11:25 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1127
No description provided.