Compare commits

...

2 commits

Author SHA1 Message Date
5324a6b119 Merge pull request 'fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)' (#857) from fix/issue-843 into main
Some checks failed
ci/woodpecker/push/ci Pipeline failed
ci/woodpecker/push/nomad-validate Pipeline was successful
2026-04-16 10:38:34 +00:00
Claude
93018b3db6 fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/nomad-validate Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
ci/woodpecker/pr/nomad-validate Pipeline was successful
ci/woodpecker/pr/secret-scan Pipeline was successful
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.

Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.

Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
  nullglob, so an empty jobs/ dir would otherwise leave the literal
  glob in $f and fail nomad job validate with "no such file". Not
  reachable today (forgejo.nomad.hcl exists), but keeps the step safe
  against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
  (default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
  specific jobspec on failure.

Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
  review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
  catches (unknown stanzas, missing required fields, wrong value
  types, bad driver config) and what it does NOT catch (cross-file
  host_volume name resolution against client.hcl — that's a
  scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
  long as the file follows the `*.nomad.hcl` naming convention. The
  suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
  host_volume scheduling-time check, so contributors know the
  paired-write rule (client.hcl + cluster-up.sh) is the real
  guardrail for that class of drift.

Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
  2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
  shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
  no change needed to `when:` block).

Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
2 changed files with 71 additions and 20 deletions

View file

@ -15,7 +15,9 @@
# #
# Steps (all fail-closed — any error blocks merge): # Steps (all fail-closed — any error blocks merge):
# 1. nomad-config-validate — `nomad config validate` on server + client HCL # 1. nomad-config-validate — `nomad config validate` on server + client HCL
# 2. nomad-job-validate — `nomad job validate` on every nomad/jobs/*.nomad.hcl # 2. nomad-job-validate — `nomad job validate` looped over every
# nomad/jobs/*.nomad.hcl (new jobspecs get
# CI coverage automatically)
# 3. vault-operator-diagnose — `vault operator diagnose` syntax check on vault.hcl # 3. vault-operator-diagnose — `vault operator diagnose` syntax check on vault.hcl
# 4. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto # 4. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto
# 5. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests # 5. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests
@ -65,14 +67,35 @@ steps:
# with "unknown block 'job'", and vice versa. Hence two separate steps. # with "unknown block 'job'", and vice versa. Hence two separate steps.
# #
# Validation is offline: no running Nomad server is required (exit 0 on # Validation is offline: no running Nomad server is required (exit 0 on
# valid HCL, 1 on syntax/semantic error). One invocation per file — the # valid HCL, 1 on syntax/semantic error). The CLI takes a single path
# CLI takes a single path argument. New jobspecs get explicit lines here # argument so we loop over every `*.nomad.hcl` file under nomad/jobs/ —
# so bringing one up is a conscious CI edit, matching step 1's pattern # that way a new jobspec PR gets CI coverage automatically (no separate
# and this file's "no-ad-hoc-steps" principle. # "edit the pipeline" step to forget). The `.nomad.hcl` suffix is the
# naming convention documented in nomad/AGENTS.md; anything else in
# nomad/jobs/ is deliberately not validated by this step.
#
# `[ -f "$f" ]` guards against the no-match case: POSIX sh does not
# nullglob, so an empty jobs/ directory would leave the literal glob in
# "$f" and fail. Today forgejo.nomad.hcl exists, but the guard keeps the
# step safe during any future transient empty state.
#
# Scope note: offline validate catches jobspec-level errors (unknown
# stanzas, missing required fields, wrong value types, invalid driver
# config). It does NOT resolve cross-file references like host_volume
# source names against nomad/client.hcl — that mismatch surfaces at
# scheduling time on the live cluster, not here. The paired-write rule
# in nomad/AGENTS.md ("add to both client.hcl and cluster-up.sh") is the
# primary guardrail for that class of drift.
- name: nomad-job-validate - name: nomad-job-validate
image: hashicorp/nomad:1.9.5 image: hashicorp/nomad:1.9.5
commands: commands:
- nomad job validate nomad/jobs/forgejo.nomad.hcl - |
set -e
for f in nomad/jobs/*.nomad.hcl; do
[ -f "$f" ] || continue
echo "validating jobspec: $f"
nomad job validate "$f"
done
# ── 3. Vault HCL syntax check ──────────────────────────────────────────── # ── 3. Vault HCL syntax check ────────────────────────────────────────────
# `vault operator diagnose` loads the config and runs a suite of checks. # `vault operator diagnose` loads the config and runs a suite of checks.

View file

@ -35,41 +35,69 @@ it owns.
## Adding a jobspec (Step 1 and later) ## Adding a jobspec (Step 1 and later)
1. Drop a file in `nomad/jobs/<service>.nomad.hcl`. 1. Drop a file in `nomad/jobs/<service>.nomad.hcl`. The `.nomad.hcl`
suffix is load-bearing: `.woodpecker/nomad-validate.yml` globs on
exactly that suffix to auto-pick up new jobspecs (see step 2 in
"How CI validates these files" below). Anything else in
`nomad/jobs/` is silently skipped by CI.
2. If it needs persistent state, reference a `host_volume` already 2. If it needs persistent state, reference a `host_volume` already
declared in `client.hcl`*don't* add ad-hoc host paths in the declared in `client.hcl`*don't* add ad-hoc host paths in the
jobspec. If a new volume is needed, add it to **both**: jobspec. If a new volume is needed, add it to **both**:
- `nomad/client.hcl` — the `host_volume "<name>" { path = … }` block - `nomad/client.hcl` — the `host_volume "<name>" { path = … }` block
- `lib/init/nomad/cluster-up.sh` — the `HOST_VOLUME_DIRS` array - `lib/init/nomad/cluster-up.sh` — the `HOST_VOLUME_DIRS` array
The two must stay in sync or nomad fingerprinting will fail and the The two must stay in sync or nomad fingerprinting will fail and the
node stays in "initializing". node stays in "initializing". Note that offline `nomad job validate`
will NOT catch a typo in the jobspec's `source = "..."` against the
client.hcl host_volume list (see step 2 below) — the scheduler
rejects the mismatch at placement time instead.
3. Pin image tags — `image = "forgejo/forgejo:1.22.5"`, not `:latest`. 3. Pin image tags — `image = "forgejo/forgejo:1.22.5"`, not `:latest`.
4. Add the jobspec path to `.woodpecker/nomad-validate.yml`'s trigger 4. No pipeline edit required — step 2 of `nomad-validate.yml` globs
list so CI validates it. over `nomad/jobs/*.nomad.hcl` and validates every match. Just make
sure the existing `nomad/**` trigger path still covers your file
(it does for anything under `nomad/jobs/`).
## How CI validates these files ## How CI validates these files
`.woodpecker/nomad-validate.yml` runs on every PR that touches `nomad/`, `.woodpecker/nomad-validate.yml` runs on every PR that touches `nomad/`
`lib/init/nomad/`, or `bin/disinto`. Four fail-closed steps: (including `nomad/jobs/`), `lib/init/nomad/`, or `bin/disinto`. Five
fail-closed steps:
1. **`nomad config validate nomad/server.hcl nomad/client.hcl`** 1. **`nomad config validate nomad/server.hcl nomad/client.hcl`**
— parses the HCL, fails on unknown blocks, bad port ranges, invalid — parses the HCL, fails on unknown blocks, bad port ranges, invalid
driver config. Vault HCL is excluded (different tool). driver config. Vault HCL is excluded (different tool). Jobspecs are
2. **`vault operator diagnose -config=nomad/vault.hcl -skip=storage -skip=listener`** excluded too — agent-config and jobspec are disjoint HCL grammars;
running this step on a jobspec rejects it with "unknown block 'job'".
2. **`nomad job validate nomad/jobs/*.nomad.hcl`** (loop, one call per file)
— parses each jobspec's HCL, fails on unknown stanzas, missing
required fields, wrong value types, invalid driver config. Runs
offline (no Nomad server needed) so CI exit 0 ≠ "this will schedule
successfully"; it means "the HCL itself is well-formed". What this
step does NOT catch:
- cross-file references (`source = "forgejo-data"` typo against the
`host_volume` list in `client.hcl`) — that's a scheduling-time
check on the live cluster, not validate-time.
- image reachability — `image = "codeberg.org/forgejo/forgejo:11.0"`
is accepted even if the registry is down or the tag is wrong.
New jobspecs are picked up automatically by the glob — no pipeline
edit needed as long as the file is named `<name>.nomad.hcl`.
3. **`vault operator diagnose -config=nomad/vault.hcl -skip=storage -skip=listener`**
— Vault's equivalent syntax + schema check. `-skip=storage/listener` — Vault's equivalent syntax + schema check. `-skip=storage/listener`
disables the runtime checks (CI containers don't have disables the runtime checks (CI containers don't have
`/var/lib/vault/data` or port 8200). `/var/lib/vault/data` or port 8200). Exit 2 (advisory warnings only,
3. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`** e.g. TLS-disabled listener) is tolerated; exit 1 blocks merge.
4. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`**
— all init/dispatcher shell clean. `bin/disinto` has no `.sh` — all init/dispatcher shell clean. `bin/disinto` has no `.sh`
extension so the repo-wide shellcheck in `.woodpecker/ci.yml` skips extension so the repo-wide shellcheck in `.woodpecker/ci.yml` skips
it — this is the one place it gets checked. it — this is the one place it gets checked.
4. **`bats tests/disinto-init-nomad.bats`** 5. **`bats tests/disinto-init-nomad.bats`**
— exercises the dispatcher: `disinto init --backend=nomad --dry-run`, — exercises the dispatcher: `disinto init --backend=nomad --dry-run`,
`… --empty --dry-run`, and the `--backend=docker` regression guard. `… --empty --dry-run`, and the `--backend=docker` regression guard.
If a PR breaks `nomad/server.hcl` (e.g. typo in a block name), step 1 If a PR breaks `nomad/server.hcl` (e.g. typo in a block name), step 1
fails with a clear error; the fix makes it pass. PRs that don't touch fails with a clear error; if it breaks a jobspec (e.g. misspells
any of the trigger paths skip this pipeline entirely. `task` as `tsak`, or adds a `volume` stanza without a `source`), step
2 fails instead. The fix makes it pass. PRs that don't touch any of
the trigger paths skip this pipeline entirely.
## Version pinning ## Version pinning