From 93018b3db64d14e059fffcae67dc936ab971fdce Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 10:32:08 +0000 Subject: [PATCH] =?UTF-8?q?fix:=20[nomad-step-1]=20S1.4=20=E2=80=94=20exte?= =?UTF-8?q?nd=20Woodpecker=20CI=20to=20nomad=20job=20validate=20nomad/jobs?= =?UTF-8?q?/*.hcl=20(#843)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Step 2 of .woodpecker/nomad-validate.yml previously ran `nomad job validate` against a single explicit path (nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec gets CI coverage automatically — no "edit the pipeline" step to forget when the next jobspec (woodpecker, caddy, agents, …) lands. Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps" principle that drove the explicit list was about keeping step *classes* enumerated, not about re-listing every file of the same class. Globbing over `*.nomad.hcl` still encodes a single class ("jobspec validation") and is strictly stricter — a dropped jobspec can't silently bypass CI because someone forgot to add its line. The `.nomad.hcl` suffix (set as convention by S1.1 review) is what keeps non-jobspec HCL out of this loop. Implementation notes: - `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no nullglob, so an empty jobs/ dir would otherwise leave the literal glob in $f and fail nomad job validate with "no such file". Not reachable today (forgejo.nomad.hcl exists), but keeps the step safe against any transient empty state during future refactors. - `set -e` inside the block ensures the first failing jobspec aborts (default Woodpecker behavior, but explicit is cheap). - Loop echoes the file being validated so CI logs point at the specific jobspec on failure. Docs (nomad/AGENTS.md): - "How CI validates these files" now lists all *five* steps (the S1.1 review added step 2 but didn't update the doc; fixed in passing). - Step 2 is documented with explicit scope: what offline validate catches (unknown stanzas, missing required fields, wrong value types, bad driver config) and what it does NOT catch (cross-file host_volume name resolution against client.hcl — that's a scheduling-time check; image reachability). - "Adding a jobspec" step 4 updated: no pipeline edit required as long as the file follows the `*.nomad.hcl` naming convention. The suffix is now documented as load-bearing in step 1. - Step 2 of the "Adding a jobspec" checklist cross-links the host_volume scheduling-time check, so contributors know the paired-write rule (client.hcl + cluster-up.sh) is the real guardrail for that class of drift. Acceptance criteria: - Broken jobspec (typo in stanza, missing required field) fails step 2 with nomad's error message — covered by the loop over every file. - Fixed jobspec passes — standard validate behavior. - Step 1 (nomad config validate) untouched. - No .sh changes, so no shellcheck impact; manual shellcheck pass shown clean. - Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed, no change needed to `when:` block). Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec) Co-Authored-By: Claude Opus 4.6 (1M context) --- .woodpecker/nomad-validate.yml | 37 ++++++++++++++++++----- nomad/AGENTS.md | 54 ++++++++++++++++++++++++++-------- 2 files changed, 71 insertions(+), 20 deletions(-) diff --git a/.woodpecker/nomad-validate.yml b/.woodpecker/nomad-validate.yml index 83946c3..d5828e9 100644 --- a/.woodpecker/nomad-validate.yml +++ b/.woodpecker/nomad-validate.yml @@ -14,8 +14,10 @@ # .woodpecker/nomad-validate.yml — the pipeline definition # # Steps (all fail-closed — any error blocks merge): -# 1. nomad-config-validate — `nomad config validate` on server + client HCL -# 2. nomad-job-validate — `nomad job validate` on every nomad/jobs/*.nomad.hcl +# 1. nomad-config-validate — `nomad config validate` on server + client HCL +# 2. nomad-job-validate — `nomad job validate` looped over every +# nomad/jobs/*.nomad.hcl (new jobspecs get +# CI coverage automatically) # 3. vault-operator-diagnose — `vault operator diagnose` syntax check on vault.hcl # 4. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto # 5. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests @@ -65,14 +67,35 @@ steps: # with "unknown block 'job'", and vice versa. Hence two separate steps. # # Validation is offline: no running Nomad server is required (exit 0 on - # valid HCL, 1 on syntax/semantic error). One invocation per file — the - # CLI takes a single path argument. New jobspecs get explicit lines here - # so bringing one up is a conscious CI edit, matching step 1's pattern - # and this file's "no-ad-hoc-steps" principle. + # valid HCL, 1 on syntax/semantic error). The CLI takes a single path + # argument so we loop over every `*.nomad.hcl` file under nomad/jobs/ — + # that way a new jobspec PR gets CI coverage automatically (no separate + # "edit the pipeline" step to forget). The `.nomad.hcl` suffix is the + # naming convention documented in nomad/AGENTS.md; anything else in + # nomad/jobs/ is deliberately not validated by this step. + # + # `[ -f "$f" ]` guards against the no-match case: POSIX sh does not + # nullglob, so an empty jobs/ directory would leave the literal glob in + # "$f" and fail. Today forgejo.nomad.hcl exists, but the guard keeps the + # step safe during any future transient empty state. + # + # Scope note: offline validate catches jobspec-level errors (unknown + # stanzas, missing required fields, wrong value types, invalid driver + # config). It does NOT resolve cross-file references like host_volume + # source names against nomad/client.hcl — that mismatch surfaces at + # scheduling time on the live cluster, not here. The paired-write rule + # in nomad/AGENTS.md ("add to both client.hcl and cluster-up.sh") is the + # primary guardrail for that class of drift. - name: nomad-job-validate image: hashicorp/nomad:1.9.5 commands: - - nomad job validate nomad/jobs/forgejo.nomad.hcl + - | + set -e + for f in nomad/jobs/*.nomad.hcl; do + [ -f "$f" ] || continue + echo "validating jobspec: $f" + nomad job validate "$f" + done # ── 3. Vault HCL syntax check ──────────────────────────────────────────── # `vault operator diagnose` loads the config and runs a suite of checks. diff --git a/nomad/AGENTS.md b/nomad/AGENTS.md index ef7a43b..d80780f 100644 --- a/nomad/AGENTS.md +++ b/nomad/AGENTS.md @@ -35,41 +35,69 @@ it owns. ## Adding a jobspec (Step 1 and later) -1. Drop a file in `nomad/jobs/.nomad.hcl`. +1. Drop a file in `nomad/jobs/.nomad.hcl`. The `.nomad.hcl` + suffix is load-bearing: `.woodpecker/nomad-validate.yml` globs on + exactly that suffix to auto-pick up new jobspecs (see step 2 in + "How CI validates these files" below). Anything else in + `nomad/jobs/` is silently skipped by CI. 2. If it needs persistent state, reference a `host_volume` already declared in `client.hcl` — *don't* add ad-hoc host paths in the jobspec. If a new volume is needed, add it to **both**: - `nomad/client.hcl` — the `host_volume "" { path = … }` block - `lib/init/nomad/cluster-up.sh` — the `HOST_VOLUME_DIRS` array The two must stay in sync or nomad fingerprinting will fail and the - node stays in "initializing". + node stays in "initializing". Note that offline `nomad job validate` + will NOT catch a typo in the jobspec's `source = "..."` against the + client.hcl host_volume list (see step 2 below) — the scheduler + rejects the mismatch at placement time instead. 3. Pin image tags — `image = "forgejo/forgejo:1.22.5"`, not `:latest`. -4. Add the jobspec path to `.woodpecker/nomad-validate.yml`'s trigger - list so CI validates it. +4. No pipeline edit required — step 2 of `nomad-validate.yml` globs + over `nomad/jobs/*.nomad.hcl` and validates every match. Just make + sure the existing `nomad/**` trigger path still covers your file + (it does for anything under `nomad/jobs/`). ## How CI validates these files -`.woodpecker/nomad-validate.yml` runs on every PR that touches `nomad/`, -`lib/init/nomad/`, or `bin/disinto`. Four fail-closed steps: +`.woodpecker/nomad-validate.yml` runs on every PR that touches `nomad/` +(including `nomad/jobs/`), `lib/init/nomad/`, or `bin/disinto`. Five +fail-closed steps: 1. **`nomad config validate nomad/server.hcl nomad/client.hcl`** — parses the HCL, fails on unknown blocks, bad port ranges, invalid - driver config. Vault HCL is excluded (different tool). -2. **`vault operator diagnose -config=nomad/vault.hcl -skip=storage -skip=listener`** + driver config. Vault HCL is excluded (different tool). Jobspecs are + excluded too — agent-config and jobspec are disjoint HCL grammars; + running this step on a jobspec rejects it with "unknown block 'job'". +2. **`nomad job validate nomad/jobs/*.nomad.hcl`** (loop, one call per file) + — parses each jobspec's HCL, fails on unknown stanzas, missing + required fields, wrong value types, invalid driver config. Runs + offline (no Nomad server needed) so CI exit 0 ≠ "this will schedule + successfully"; it means "the HCL itself is well-formed". What this + step does NOT catch: + - cross-file references (`source = "forgejo-data"` typo against the + `host_volume` list in `client.hcl`) — that's a scheduling-time + check on the live cluster, not validate-time. + - image reachability — `image = "codeberg.org/forgejo/forgejo:11.0"` + is accepted even if the registry is down or the tag is wrong. + New jobspecs are picked up automatically by the glob — no pipeline + edit needed as long as the file is named `.nomad.hcl`. +3. **`vault operator diagnose -config=nomad/vault.hcl -skip=storage -skip=listener`** — Vault's equivalent syntax + schema check. `-skip=storage/listener` disables the runtime checks (CI containers don't have - `/var/lib/vault/data` or port 8200). -3. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`** + `/var/lib/vault/data` or port 8200). Exit 2 (advisory warnings only, + e.g. TLS-disabled listener) is tolerated; exit 1 blocks merge. +4. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`** — all init/dispatcher shell clean. `bin/disinto` has no `.sh` extension so the repo-wide shellcheck in `.woodpecker/ci.yml` skips it — this is the one place it gets checked. -4. **`bats tests/disinto-init-nomad.bats`** +5. **`bats tests/disinto-init-nomad.bats`** — exercises the dispatcher: `disinto init --backend=nomad --dry-run`, `… --empty --dry-run`, and the `--backend=docker` regression guard. If a PR breaks `nomad/server.hcl` (e.g. typo in a block name), step 1 -fails with a clear error; the fix makes it pass. PRs that don't touch -any of the trigger paths skip this pipeline entirely. +fails with a clear error; if it breaks a jobspec (e.g. misspells +`task` as `tsak`, or adds a `volume` stanza without a `source`), step +2 fails instead. The fix makes it pass. PRs that don't touch any of +the trigger paths skip this pipeline entirely. ## Version pinning -- 2.49.1