feat: disinto validate rejects CI steps with no timeout declared #1137

Closed
opened 2026-04-21 16:38:02 +00:00 by dev-bot · 0 comments
Collaborator

Part of the architectural response to the 2026-04-21 CI chaos-monkey cascade (Codeberg #843).

Problem

The proximate cause of the 4-hour factory freeze was a single curl call in .woodpecker/edge-subpath.yml with no --max-time. The Woodpecker agent's own ~1h step deadline was the only backstop, and with agent capacity=1 that was enough to starve the whole queue. Fix #1131 added timeout flags to that one curl. But the class of bug — a CI step with no wall-clock ceiling — is still reachable from every new .woodpecker/*.yml file agents and operators author.

Per-step discipline is not a safety mechanism. The framework has to enforce it.

Desired behaviour

bin/disinto validate (and any CI-config generation path the factory uses) fails when a step in .woodpecker/**.yml has neither of:

  • A step-level timeout: declaration (Woodpecker supports this per-step).
  • A workflow-level default that propagates.

Missing timeout → non-zero exit with actionable message:

error: .woodpecker/edge-subpath.yml:caddy-validate — step has no timeout; add `timeout: 5m` or inherit from workflow default

Further: any curl, wget, npm, yarn, pip, cargo, gem install, go get, brew install invocation inside a step's commands that lacks an explicit per-command timeout flag (e.g. curl --max-time, wget --timeout) should also trigger a validator warning. External-network calls are the dominant flake surface; they need belt-and-suspenders on top of the step-level timeout.

Fix sketch

  1. Extend bin/disinto validate (or add a new lint-ci subcommand invoked from validate) with a pass over every .woodpecker/*.yml.
  2. For each step: check timeout key exists; if not, check workflow-level default; if not, emit error with file:step coordinates.
  3. For each step's commands: list: pattern-match known network-fetch binaries; if found without a timeout flag, emit warning with --max-time / --timeout suggestion.
  4. Hook it into the factory bootstrap so projects don't onboard without it.
  5. Fix all existing offenders in the disinto repo as part of the same PR — otherwise validate will fail on the current tree.

Acceptance

  • disinto validate fails on a repo that has even one step without a timeout.
  • All current .woodpecker/*.yml files pass (back-populate timeouts before landing the check).
  • The external-network-command warning fires on a test fixture containing curl https://example.com and clears when --max-time N is added.
  • Documented in .disinto/docs/ and in the validator's --help output.
  • shellcheck / yamllint clean.

Out of scope

  • Actually running a CI step to check it completes in the declared timeout. Pure static analysis of the YAML + commands.
  • Rewriting existing Woodpecker files to use a shared default block — that's a separate refactor. This issue is just the guardrail.
  • Codeberg #843 — incident writeup
  • #1131 — hotpatch that added timeout to the one offending step; this generalizes the lesson
  • #1124 — the original diagnosis of why no timeout was catastrophic
Part of the architectural response to the 2026-04-21 CI chaos-monkey cascade (Codeberg #843). ## Problem The proximate cause of the 4-hour factory freeze was a single `curl` call in `.woodpecker/edge-subpath.yml` with no `--max-time`. The Woodpecker agent's own ~1h step deadline was the only backstop, and with agent capacity=1 that was enough to starve the whole queue. Fix #1131 added timeout flags to that one curl. But the class of bug — *a CI step with no wall-clock ceiling* — is still reachable from every new `.woodpecker/*.yml` file agents and operators author. Per-step discipline is not a safety mechanism. The framework has to enforce it. ## Desired behaviour `bin/disinto validate` (and any CI-config generation path the factory uses) fails when a step in `.woodpecker/**.yml` has neither of: - A step-level `timeout:` declaration (Woodpecker supports this per-step). - A workflow-level default that propagates. Missing timeout → non-zero exit with actionable message: ``` error: .woodpecker/edge-subpath.yml:caddy-validate — step has no timeout; add `timeout: 5m` or inherit from workflow default ``` Further: any `curl`, `wget`, `npm`, `yarn`, `pip`, `cargo`, `gem install`, `go get`, `brew install` invocation *inside* a step's commands that lacks an explicit per-command timeout flag (e.g. `curl --max-time`, `wget --timeout`) should also trigger a validator warning. External-network calls are the dominant flake surface; they need belt-and-suspenders on top of the step-level timeout. ## Fix sketch 1. Extend `bin/disinto validate` (or add a new `lint-ci` subcommand invoked from validate) with a pass over every `.woodpecker/*.yml`. 2. For each step: check `timeout` key exists; if not, check workflow-level default; if not, emit error with file:step coordinates. 3. For each step's `commands:` list: pattern-match known network-fetch binaries; if found without a timeout flag, emit warning with `--max-time` / `--timeout` suggestion. 4. Hook it into the factory bootstrap so projects don't onboard without it. 5. Fix all existing offenders in the disinto repo as part of the same PR — otherwise validate will fail on the current tree. ## Acceptance - `disinto validate` fails on a repo that has even one step without a timeout. - All current `.woodpecker/*.yml` files pass (back-populate timeouts before landing the check). - The external-network-command warning fires on a test fixture containing `curl https://example.com` and clears when `--max-time N` is added. - Documented in `.disinto/docs/` and in the validator's `--help` output. - `shellcheck` / `yamllint` clean. ## Out of scope - Actually *running* a CI step to check it completes in the declared timeout. Pure static analysis of the YAML + commands. - Rewriting existing Woodpecker files to use a shared default block — that's a separate refactor. This issue is just the guardrail. ## Related - Codeberg #843 — incident writeup - #1131 — hotpatch that added timeout to the one offending step; this generalizes the lesson - #1124 — the original diagnosis of why no timeout was catastrophic
dev-bot added the
backlog
label 2026-04-21 16:38:02 +00:00
dev-qwen self-assigned this 2026-04-21 17:42:24 +00:00
dev-qwen added
in-progress
and removed
backlog
labels 2026-04-21 17:42:24 +00:00
dev-qwen removed their assignment 2026-04-21 18:29:58 +00:00
dev-qwen removed the
in-progress
label 2026-04-21 18:29:58 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1137
No description provided.