disinto/nomad/AGENTS.md

160 lines
8.5 KiB
Markdown
Raw Permalink Normal View History

<!-- last-reviewed: 6bdbeb5bd2a200ff1b23724564da9383193f3e30 -->
# nomad/ — Agent Instructions
Nomad + Vault HCL for the factory's single-node cluster. These files are
the source of truth that `lib/init/nomad/cluster-up.sh` copies onto a
factory box under `/etc/nomad.d/` and `/etc/vault.d/` at init time.
This directory covers the **Nomad+Vault migration (Steps 02)**
see issues #821#884 for the step breakdown.
## What lives here
| File/Dir | Deployed to | Owned by |
|---|---|---|
| `server.hcl` | `/etc/nomad.d/server.hcl` | agent role, bind, ports, `data_dir` (S0.2) |
| `client.hcl` | `/etc/nomad.d/client.hcl` | Docker driver cfg + `host_volume` declarations (S0.2) |
| `vault.hcl` | `/etc/vault.d/vault.hcl` | Vault storage, listener, UI, `disable_mlock` (S0.3) |
| `jobs/forgejo.hcl` | submitted via `lib/init/nomad/deploy.sh` | Forgejo job; reads creds from Vault via consul-template stanza (S2.4) |
Nomad auto-merges every `*.hcl` under `-config=/etc/nomad.d/`, so the
split between `server.hcl` and `client.hcl` is for readability, not
semantics. The top-of-file header in each config documents which blocks
it owns.
## Vault ACL policies
`vault/policies/` holds one `.hcl` file per Vault policy; see
[`vault/policies/AGENTS.md`](../vault/policies/AGENTS.md) for the naming
convention, KV path summary, and JWT-auth role bindings (S2.1/S2.3).
## Not yet implemented
- **Additional jobspecs** (woodpecker, agents, caddy) — Step 1 brought up
Forgejo; remaining services land in later steps.
- **TLS, ACLs, gossip encryption** — deliberately absent for now; land
alongside multi-node support.
## Adding a jobspec (Step 1 and later)
1. Drop a file in `nomad/jobs/<service>.hcl`. The `.hcl` suffix is
load-bearing: `.woodpecker/nomad-validate.yml` globs on exactly that
suffix to auto-pick up new jobspecs (see step 2 in "How CI validates
these files" below). Anything else in `nomad/jobs/` is silently
skipped by CI.
2. If it needs persistent state, reference a `host_volume` already
declared in `client.hcl`*don't* add ad-hoc host paths in the
jobspec. If a new volume is needed, add it to **both**:
- `nomad/client.hcl` — the `host_volume "<name>" { path = … }` block
- `lib/init/nomad/cluster-up.sh` — the `HOST_VOLUME_DIRS` array
The two must stay in sync or nomad fingerprinting will fail and the
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843) Step 2 of .woodpecker/nomad-validate.yml previously ran `nomad job validate` against a single explicit path (nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec gets CI coverage automatically — no "edit the pipeline" step to forget when the next jobspec (woodpecker, caddy, agents, …) lands. Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps" principle that drove the explicit list was about keeping step *classes* enumerated, not about re-listing every file of the same class. Globbing over `*.nomad.hcl` still encodes a single class ("jobspec validation") and is strictly stricter — a dropped jobspec can't silently bypass CI because someone forgot to add its line. The `.nomad.hcl` suffix (set as convention by S1.1 review) is what keeps non-jobspec HCL out of this loop. Implementation notes: - `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no nullglob, so an empty jobs/ dir would otherwise leave the literal glob in $f and fail nomad job validate with "no such file". Not reachable today (forgejo.nomad.hcl exists), but keeps the step safe against any transient empty state during future refactors. - `set -e` inside the block ensures the first failing jobspec aborts (default Woodpecker behavior, but explicit is cheap). - Loop echoes the file being validated so CI logs point at the specific jobspec on failure. Docs (nomad/AGENTS.md): - "How CI validates these files" now lists all *five* steps (the S1.1 review added step 2 but didn't update the doc; fixed in passing). - Step 2 is documented with explicit scope: what offline validate catches (unknown stanzas, missing required fields, wrong value types, bad driver config) and what it does NOT catch (cross-file host_volume name resolution against client.hcl — that's a scheduling-time check; image reachability). - "Adding a jobspec" step 4 updated: no pipeline edit required as long as the file follows the `*.nomad.hcl` naming convention. The suffix is now documented as load-bearing in step 1. - Step 2 of the "Adding a jobspec" checklist cross-links the host_volume scheduling-time check, so contributors know the paired-write rule (client.hcl + cluster-up.sh) is the real guardrail for that class of drift. Acceptance criteria: - Broken jobspec (typo in stanza, missing required field) fails step 2 with nomad's error message — covered by the loop over every file. - Fixed jobspec passes — standard validate behavior. - Step 1 (nomad config validate) untouched. - No .sh changes, so no shellcheck impact; manual shellcheck pass shown clean. - Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed, no change needed to `when:` block). Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
node stays in "initializing". Note that offline `nomad job validate`
will NOT catch a typo in the jobspec's `source = "..."` against the
client.hcl host_volume list (see step 2 below) — the scheduler
rejects the mismatch at placement time instead.
3. Pin image tags — `image = "forgejo/forgejo:1.22.5"`, not `:latest`.
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843) Step 2 of .woodpecker/nomad-validate.yml previously ran `nomad job validate` against a single explicit path (nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec gets CI coverage automatically — no "edit the pipeline" step to forget when the next jobspec (woodpecker, caddy, agents, …) lands. Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps" principle that drove the explicit list was about keeping step *classes* enumerated, not about re-listing every file of the same class. Globbing over `*.nomad.hcl` still encodes a single class ("jobspec validation") and is strictly stricter — a dropped jobspec can't silently bypass CI because someone forgot to add its line. The `.nomad.hcl` suffix (set as convention by S1.1 review) is what keeps non-jobspec HCL out of this loop. Implementation notes: - `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no nullglob, so an empty jobs/ dir would otherwise leave the literal glob in $f and fail nomad job validate with "no such file". Not reachable today (forgejo.nomad.hcl exists), but keeps the step safe against any transient empty state during future refactors. - `set -e` inside the block ensures the first failing jobspec aborts (default Woodpecker behavior, but explicit is cheap). - Loop echoes the file being validated so CI logs point at the specific jobspec on failure. Docs (nomad/AGENTS.md): - "How CI validates these files" now lists all *five* steps (the S1.1 review added step 2 but didn't update the doc; fixed in passing). - Step 2 is documented with explicit scope: what offline validate catches (unknown stanzas, missing required fields, wrong value types, bad driver config) and what it does NOT catch (cross-file host_volume name resolution against client.hcl — that's a scheduling-time check; image reachability). - "Adding a jobspec" step 4 updated: no pipeline edit required as long as the file follows the `*.nomad.hcl` naming convention. The suffix is now documented as load-bearing in step 1. - Step 2 of the "Adding a jobspec" checklist cross-links the host_volume scheduling-time check, so contributors know the paired-write rule (client.hcl + cluster-up.sh) is the real guardrail for that class of drift. Acceptance criteria: - Broken jobspec (typo in stanza, missing required field) fails step 2 with nomad's error message — covered by the loop over every file. - Fixed jobspec passes — standard validate behavior. - Step 1 (nomad config validate) untouched. - No .sh changes, so no shellcheck impact; manual shellcheck pass shown clean. - Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed, no change needed to `when:` block). Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
4. No pipeline edit required — step 2 of `nomad-validate.yml` globs
over `nomad/jobs/*.hcl` and validates every match. Just make sure
the existing `nomad/**` trigger path still covers your file (it
does for anything under `nomad/jobs/`).
## How CI validates these files
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843) Step 2 of .woodpecker/nomad-validate.yml previously ran `nomad job validate` against a single explicit path (nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec gets CI coverage automatically — no "edit the pipeline" step to forget when the next jobspec (woodpecker, caddy, agents, …) lands. Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps" principle that drove the explicit list was about keeping step *classes* enumerated, not about re-listing every file of the same class. Globbing over `*.nomad.hcl` still encodes a single class ("jobspec validation") and is strictly stricter — a dropped jobspec can't silently bypass CI because someone forgot to add its line. The `.nomad.hcl` suffix (set as convention by S1.1 review) is what keeps non-jobspec HCL out of this loop. Implementation notes: - `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no nullglob, so an empty jobs/ dir would otherwise leave the literal glob in $f and fail nomad job validate with "no such file". Not reachable today (forgejo.nomad.hcl exists), but keeps the step safe against any transient empty state during future refactors. - `set -e` inside the block ensures the first failing jobspec aborts (default Woodpecker behavior, but explicit is cheap). - Loop echoes the file being validated so CI logs point at the specific jobspec on failure. Docs (nomad/AGENTS.md): - "How CI validates these files" now lists all *five* steps (the S1.1 review added step 2 but didn't update the doc; fixed in passing). - Step 2 is documented with explicit scope: what offline validate catches (unknown stanzas, missing required fields, wrong value types, bad driver config) and what it does NOT catch (cross-file host_volume name resolution against client.hcl — that's a scheduling-time check; image reachability). - "Adding a jobspec" step 4 updated: no pipeline edit required as long as the file follows the `*.nomad.hcl` naming convention. The suffix is now documented as load-bearing in step 1. - Step 2 of the "Adding a jobspec" checklist cross-links the host_volume scheduling-time check, so contributors know the paired-write rule (client.hcl + cluster-up.sh) is the real guardrail for that class of drift. Acceptance criteria: - Broken jobspec (typo in stanza, missing required field) fails step 2 with nomad's error message — covered by the loop over every file. - Fixed jobspec passes — standard validate behavior. - Step 1 (nomad config validate) untouched. - No .sh changes, so no shellcheck impact; manual shellcheck pass shown clean. - Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed, no change needed to `when:` block). Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
`.woodpecker/nomad-validate.yml` runs on every PR that touches `nomad/`
fix: [nomad-step-2] S2.6 — CI: vault policy fmt + validate + roles.yaml check (#884) Extend .woodpecker/nomad-validate.yml with three new fail-closed steps that guard every artifact under vault/policies/ and vault/roles.yaml before it can land: 4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5 has no `policy fmt -check` flag, so we build the non-destructive check out of `vault policy fmt` on a /tmp copy + diff against the original) 5. vault-policy-validate — HCL syntax + capability validation via `vault policy write` against an inline dev-mode Vault server (no offline `policy validate` subcommand exists; dev-mode writes are ephemeral so this is a validator, not a deploy) 6. vault-roles-validate — yamllint + PyYAML-based role→policy reference check (every role's `policy:` field must match a vault/policies/*.hcl basename; also checks the four required fields name/policy/namespace/job_id) Secret-scan coverage for vault/policies/*.hcl is already provided by the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger path — this pipeline intentionally does NOT duplicate that gate to avoid the inline-heredoc / YAML-parse failure mode that sank the prior attempt at this issue (PR #896). Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`. `lib/init/nomad/vault-*.sh` is already covered by the existing `lib/init/nomad/**` glob. Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the policy lifecycle, the CI enforcement table, and the common failure modes authors will see. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 18:15:03 +00:00
(including `nomad/jobs/`), `lib/init/nomad/`, `bin/disinto`,
`vault/policies/`, or `vault/roles.yaml`. Eight fail-closed steps:
1. **`nomad config validate nomad/server.hcl nomad/client.hcl`**
— parses the HCL, fails on unknown blocks, bad port ranges, invalid
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843) Step 2 of .woodpecker/nomad-validate.yml previously ran `nomad job validate` against a single explicit path (nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec gets CI coverage automatically — no "edit the pipeline" step to forget when the next jobspec (woodpecker, caddy, agents, …) lands. Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps" principle that drove the explicit list was about keeping step *classes* enumerated, not about re-listing every file of the same class. Globbing over `*.nomad.hcl` still encodes a single class ("jobspec validation") and is strictly stricter — a dropped jobspec can't silently bypass CI because someone forgot to add its line. The `.nomad.hcl` suffix (set as convention by S1.1 review) is what keeps non-jobspec HCL out of this loop. Implementation notes: - `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no nullglob, so an empty jobs/ dir would otherwise leave the literal glob in $f and fail nomad job validate with "no such file". Not reachable today (forgejo.nomad.hcl exists), but keeps the step safe against any transient empty state during future refactors. - `set -e` inside the block ensures the first failing jobspec aborts (default Woodpecker behavior, but explicit is cheap). - Loop echoes the file being validated so CI logs point at the specific jobspec on failure. Docs (nomad/AGENTS.md): - "How CI validates these files" now lists all *five* steps (the S1.1 review added step 2 but didn't update the doc; fixed in passing). - Step 2 is documented with explicit scope: what offline validate catches (unknown stanzas, missing required fields, wrong value types, bad driver config) and what it does NOT catch (cross-file host_volume name resolution against client.hcl — that's a scheduling-time check; image reachability). - "Adding a jobspec" step 4 updated: no pipeline edit required as long as the file follows the `*.nomad.hcl` naming convention. The suffix is now documented as load-bearing in step 1. - Step 2 of the "Adding a jobspec" checklist cross-links the host_volume scheduling-time check, so contributors know the paired-write rule (client.hcl + cluster-up.sh) is the real guardrail for that class of drift. Acceptance criteria: - Broken jobspec (typo in stanza, missing required field) fails step 2 with nomad's error message — covered by the loop over every file. - Fixed jobspec passes — standard validate behavior. - Step 1 (nomad config validate) untouched. - No .sh changes, so no shellcheck impact; manual shellcheck pass shown clean. - Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed, no change needed to `when:` block). Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
driver config. Vault HCL is excluded (different tool). Jobspecs are
excluded too — agent-config and jobspec are disjoint HCL grammars;
running this step on a jobspec rejects it with "unknown block 'job'".
2. **`nomad job validate nomad/jobs/*.hcl`** (loop, one call per file)
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843) Step 2 of .woodpecker/nomad-validate.yml previously ran `nomad job validate` against a single explicit path (nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec gets CI coverage automatically — no "edit the pipeline" step to forget when the next jobspec (woodpecker, caddy, agents, …) lands. Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps" principle that drove the explicit list was about keeping step *classes* enumerated, not about re-listing every file of the same class. Globbing over `*.nomad.hcl` still encodes a single class ("jobspec validation") and is strictly stricter — a dropped jobspec can't silently bypass CI because someone forgot to add its line. The `.nomad.hcl` suffix (set as convention by S1.1 review) is what keeps non-jobspec HCL out of this loop. Implementation notes: - `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no nullglob, so an empty jobs/ dir would otherwise leave the literal glob in $f and fail nomad job validate with "no such file". Not reachable today (forgejo.nomad.hcl exists), but keeps the step safe against any transient empty state during future refactors. - `set -e` inside the block ensures the first failing jobspec aborts (default Woodpecker behavior, but explicit is cheap). - Loop echoes the file being validated so CI logs point at the specific jobspec on failure. Docs (nomad/AGENTS.md): - "How CI validates these files" now lists all *five* steps (the S1.1 review added step 2 but didn't update the doc; fixed in passing). - Step 2 is documented with explicit scope: what offline validate catches (unknown stanzas, missing required fields, wrong value types, bad driver config) and what it does NOT catch (cross-file host_volume name resolution against client.hcl — that's a scheduling-time check; image reachability). - "Adding a jobspec" step 4 updated: no pipeline edit required as long as the file follows the `*.nomad.hcl` naming convention. The suffix is now documented as load-bearing in step 1. - Step 2 of the "Adding a jobspec" checklist cross-links the host_volume scheduling-time check, so contributors know the paired-write rule (client.hcl + cluster-up.sh) is the real guardrail for that class of drift. Acceptance criteria: - Broken jobspec (typo in stanza, missing required field) fails step 2 with nomad's error message — covered by the loop over every file. - Fixed jobspec passes — standard validate behavior. - Step 1 (nomad config validate) untouched. - No .sh changes, so no shellcheck impact; manual shellcheck pass shown clean. - Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed, no change needed to `when:` block). Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
— parses each jobspec's HCL, fails on unknown stanzas, missing
required fields, wrong value types, invalid driver config. Runs
offline (no Nomad server needed) so CI exit 0 ≠ "this will schedule
successfully"; it means "the HCL itself is well-formed". What this
step does NOT catch:
- cross-file references (`source = "forgejo-data"` typo against the
`host_volume` list in `client.hcl`) — that's a scheduling-time
check on the live cluster, not validate-time.
- image reachability — `image = "codeberg.org/forgejo/forgejo:11.0"`
is accepted even if the registry is down or the tag is wrong.
New jobspecs are picked up automatically by the glob — no pipeline
edit needed as long as the file is named `<name>.hcl`.
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843) Step 2 of .woodpecker/nomad-validate.yml previously ran `nomad job validate` against a single explicit path (nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec gets CI coverage automatically — no "edit the pipeline" step to forget when the next jobspec (woodpecker, caddy, agents, …) lands. Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps" principle that drove the explicit list was about keeping step *classes* enumerated, not about re-listing every file of the same class. Globbing over `*.nomad.hcl` still encodes a single class ("jobspec validation") and is strictly stricter — a dropped jobspec can't silently bypass CI because someone forgot to add its line. The `.nomad.hcl` suffix (set as convention by S1.1 review) is what keeps non-jobspec HCL out of this loop. Implementation notes: - `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no nullglob, so an empty jobs/ dir would otherwise leave the literal glob in $f and fail nomad job validate with "no such file". Not reachable today (forgejo.nomad.hcl exists), but keeps the step safe against any transient empty state during future refactors. - `set -e` inside the block ensures the first failing jobspec aborts (default Woodpecker behavior, but explicit is cheap). - Loop echoes the file being validated so CI logs point at the specific jobspec on failure. Docs (nomad/AGENTS.md): - "How CI validates these files" now lists all *five* steps (the S1.1 review added step 2 but didn't update the doc; fixed in passing). - Step 2 is documented with explicit scope: what offline validate catches (unknown stanzas, missing required fields, wrong value types, bad driver config) and what it does NOT catch (cross-file host_volume name resolution against client.hcl — that's a scheduling-time check; image reachability). - "Adding a jobspec" step 4 updated: no pipeline edit required as long as the file follows the `*.nomad.hcl` naming convention. The suffix is now documented as load-bearing in step 1. - Step 2 of the "Adding a jobspec" checklist cross-links the host_volume scheduling-time check, so contributors know the paired-write rule (client.hcl + cluster-up.sh) is the real guardrail for that class of drift. Acceptance criteria: - Broken jobspec (typo in stanza, missing required field) fails step 2 with nomad's error message — covered by the loop over every file. - Fixed jobspec passes — standard validate behavior. - Step 1 (nomad config validate) untouched. - No .sh changes, so no shellcheck impact; manual shellcheck pass shown clean. - Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed, no change needed to `when:` block). Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
3. **`vault operator diagnose -config=nomad/vault.hcl -skip=storage -skip=listener`**
— Vault's equivalent syntax + schema check. `-skip=storage/listener`
disables the runtime checks (CI containers don't have
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843) Step 2 of .woodpecker/nomad-validate.yml previously ran `nomad job validate` against a single explicit path (nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec gets CI coverage automatically — no "edit the pipeline" step to forget when the next jobspec (woodpecker, caddy, agents, …) lands. Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps" principle that drove the explicit list was about keeping step *classes* enumerated, not about re-listing every file of the same class. Globbing over `*.nomad.hcl` still encodes a single class ("jobspec validation") and is strictly stricter — a dropped jobspec can't silently bypass CI because someone forgot to add its line. The `.nomad.hcl` suffix (set as convention by S1.1 review) is what keeps non-jobspec HCL out of this loop. Implementation notes: - `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no nullglob, so an empty jobs/ dir would otherwise leave the literal glob in $f and fail nomad job validate with "no such file". Not reachable today (forgejo.nomad.hcl exists), but keeps the step safe against any transient empty state during future refactors. - `set -e` inside the block ensures the first failing jobspec aborts (default Woodpecker behavior, but explicit is cheap). - Loop echoes the file being validated so CI logs point at the specific jobspec on failure. Docs (nomad/AGENTS.md): - "How CI validates these files" now lists all *five* steps (the S1.1 review added step 2 but didn't update the doc; fixed in passing). - Step 2 is documented with explicit scope: what offline validate catches (unknown stanzas, missing required fields, wrong value types, bad driver config) and what it does NOT catch (cross-file host_volume name resolution against client.hcl — that's a scheduling-time check; image reachability). - "Adding a jobspec" step 4 updated: no pipeline edit required as long as the file follows the `*.nomad.hcl` naming convention. The suffix is now documented as load-bearing in step 1. - Step 2 of the "Adding a jobspec" checklist cross-links the host_volume scheduling-time check, so contributors know the paired-write rule (client.hcl + cluster-up.sh) is the real guardrail for that class of drift. Acceptance criteria: - Broken jobspec (typo in stanza, missing required field) fails step 2 with nomad's error message — covered by the loop over every file. - Fixed jobspec passes — standard validate behavior. - Step 1 (nomad config validate) untouched. - No .sh changes, so no shellcheck impact; manual shellcheck pass shown clean. - Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed, no change needed to `when:` block). Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
`/var/lib/vault/data` or port 8200). Exit 2 (advisory warnings only,
e.g. TLS-disabled listener) is tolerated; exit 1 blocks merge.
fix: [nomad-step-2] S2.6 — CI: vault policy fmt + validate + roles.yaml check (#884) Extend .woodpecker/nomad-validate.yml with three new fail-closed steps that guard every artifact under vault/policies/ and vault/roles.yaml before it can land: 4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5 has no `policy fmt -check` flag, so we build the non-destructive check out of `vault policy fmt` on a /tmp copy + diff against the original) 5. vault-policy-validate — HCL syntax + capability validation via `vault policy write` against an inline dev-mode Vault server (no offline `policy validate` subcommand exists; dev-mode writes are ephemeral so this is a validator, not a deploy) 6. vault-roles-validate — yamllint + PyYAML-based role→policy reference check (every role's `policy:` field must match a vault/policies/*.hcl basename; also checks the four required fields name/policy/namespace/job_id) Secret-scan coverage for vault/policies/*.hcl is already provided by the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger path — this pipeline intentionally does NOT duplicate that gate to avoid the inline-heredoc / YAML-parse failure mode that sank the prior attempt at this issue (PR #896). Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`. `lib/init/nomad/vault-*.sh` is already covered by the existing `lib/init/nomad/**` glob. Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the policy lifecycle, the CI enforcement table, and the common failure modes authors will see. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 18:15:03 +00:00
4. **`vault policy fmt` idempotence check on every `vault/policies/*.hcl`**
(S2.6) — `vault policy fmt` has no `-check` flag in 1.18.5, so the
step copies each file to `/tmp`, runs `vault policy fmt` on the copy,
and diffs against the original. Any non-empty diff means the
committed file would be rewritten by `fmt` and the step fails — the
author is pointed at `vault policy fmt <file>` to heal the drift.
5. **`vault policy write`-based validation against an inline dev-mode Vault**
(S2.6) — Vault 1.18.5 has no offline `policy validate` subcommand;
the CI step starts a dev-mode server, loops `vault policy write
<basename> <file>` over each `vault/policies/*.hcl`, and aggregates
failures so one CI run surfaces every broken policy. The server is
ephemeral and torn down on step exit — no persistence, no real
secrets. Catches unknown capability names (e.g. `"frobnicate"`),
malformed `path` blocks, and other semantic errors `fmt` does not.
6. **`vault/roles.yaml` validator** (S2.6) — yamllint + a PyYAML-based
check that every role's `policy:` field matches a basename under
`vault/policies/`, and that every role entry carries all four
required fields (`name`, `policy`, `namespace`, `job_id`). Drift
between the two directories is a scheduling-time "permission denied"
in production; this step turns it into a CI failure at PR time.
7. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`**
— all init/dispatcher shell clean. `bin/disinto` has no `.sh`
extension so the repo-wide shellcheck in `.woodpecker/ci.yml` skips
it — this is the one place it gets checked.
fix: [nomad-step-2] S2.6 — CI: vault policy fmt + validate + roles.yaml check (#884) Extend .woodpecker/nomad-validate.yml with three new fail-closed steps that guard every artifact under vault/policies/ and vault/roles.yaml before it can land: 4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5 has no `policy fmt -check` flag, so we build the non-destructive check out of `vault policy fmt` on a /tmp copy + diff against the original) 5. vault-policy-validate — HCL syntax + capability validation via `vault policy write` against an inline dev-mode Vault server (no offline `policy validate` subcommand exists; dev-mode writes are ephemeral so this is a validator, not a deploy) 6. vault-roles-validate — yamllint + PyYAML-based role→policy reference check (every role's `policy:` field must match a vault/policies/*.hcl basename; also checks the four required fields name/policy/namespace/job_id) Secret-scan coverage for vault/policies/*.hcl is already provided by the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger path — this pipeline intentionally does NOT duplicate that gate to avoid the inline-heredoc / YAML-parse failure mode that sank the prior attempt at this issue (PR #896). Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`. `lib/init/nomad/vault-*.sh` is already covered by the existing `lib/init/nomad/**` glob. Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the policy lifecycle, the CI enforcement table, and the common failure modes authors will see. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 18:15:03 +00:00
8. **`bats tests/disinto-init-nomad.bats`**
— exercises the dispatcher: `disinto init --backend=nomad --dry-run`,
`… --empty --dry-run`, and the `--backend=docker` regression guard.
fix: [nomad-step-2] S2.6 — CI: vault policy fmt + validate + roles.yaml check (#884) Extend .woodpecker/nomad-validate.yml with three new fail-closed steps that guard every artifact under vault/policies/ and vault/roles.yaml before it can land: 4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5 has no `policy fmt -check` flag, so we build the non-destructive check out of `vault policy fmt` on a /tmp copy + diff against the original) 5. vault-policy-validate — HCL syntax + capability validation via `vault policy write` against an inline dev-mode Vault server (no offline `policy validate` subcommand exists; dev-mode writes are ephemeral so this is a validator, not a deploy) 6. vault-roles-validate — yamllint + PyYAML-based role→policy reference check (every role's `policy:` field must match a vault/policies/*.hcl basename; also checks the four required fields name/policy/namespace/job_id) Secret-scan coverage for vault/policies/*.hcl is already provided by the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger path — this pipeline intentionally does NOT duplicate that gate to avoid the inline-heredoc / YAML-parse failure mode that sank the prior attempt at this issue (PR #896). Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`. `lib/init/nomad/vault-*.sh` is already covered by the existing `lib/init/nomad/**` glob. Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the policy lifecycle, the CI enforcement table, and the common failure modes authors will see. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 18:15:03 +00:00
**Secret-scan coverage.** Policy HCL files under `vault/policies/` are
already swept by the P11 secret-scan gate
(`.woodpecker/secret-scan.yml`, #798), whose `vault/**/*` trigger path
covers everything in this directory. `nomad-validate.yml` intentionally
does NOT duplicate that gate — one scanner, one source of truth.
If a PR breaks `nomad/server.hcl` (e.g. typo in a block name), step 1
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843) Step 2 of .woodpecker/nomad-validate.yml previously ran `nomad job validate` against a single explicit path (nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec gets CI coverage automatically — no "edit the pipeline" step to forget when the next jobspec (woodpecker, caddy, agents, …) lands. Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps" principle that drove the explicit list was about keeping step *classes* enumerated, not about re-listing every file of the same class. Globbing over `*.nomad.hcl` still encodes a single class ("jobspec validation") and is strictly stricter — a dropped jobspec can't silently bypass CI because someone forgot to add its line. The `.nomad.hcl` suffix (set as convention by S1.1 review) is what keeps non-jobspec HCL out of this loop. Implementation notes: - `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no nullglob, so an empty jobs/ dir would otherwise leave the literal glob in $f and fail nomad job validate with "no such file". Not reachable today (forgejo.nomad.hcl exists), but keeps the step safe against any transient empty state during future refactors. - `set -e` inside the block ensures the first failing jobspec aborts (default Woodpecker behavior, but explicit is cheap). - Loop echoes the file being validated so CI logs point at the specific jobspec on failure. Docs (nomad/AGENTS.md): - "How CI validates these files" now lists all *five* steps (the S1.1 review added step 2 but didn't update the doc; fixed in passing). - Step 2 is documented with explicit scope: what offline validate catches (unknown stanzas, missing required fields, wrong value types, bad driver config) and what it does NOT catch (cross-file host_volume name resolution against client.hcl — that's a scheduling-time check; image reachability). - "Adding a jobspec" step 4 updated: no pipeline edit required as long as the file follows the `*.nomad.hcl` naming convention. The suffix is now documented as load-bearing in step 1. - Step 2 of the "Adding a jobspec" checklist cross-links the host_volume scheduling-time check, so contributors know the paired-write rule (client.hcl + cluster-up.sh) is the real guardrail for that class of drift. Acceptance criteria: - Broken jobspec (typo in stanza, missing required field) fails step 2 with nomad's error message — covered by the loop over every file. - Fixed jobspec passes — standard validate behavior. - Step 1 (nomad config validate) untouched. - No .sh changes, so no shellcheck impact; manual shellcheck pass shown clean. - Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed, no change needed to `when:` block). Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
fails with a clear error; if it breaks a jobspec (e.g. misspells
`task` as `tsak`, or adds a `volume` stanza without a `source`), step
fix: [nomad-step-2] S2.6 — CI: vault policy fmt + validate + roles.yaml check (#884) Extend .woodpecker/nomad-validate.yml with three new fail-closed steps that guard every artifact under vault/policies/ and vault/roles.yaml before it can land: 4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5 has no `policy fmt -check` flag, so we build the non-destructive check out of `vault policy fmt` on a /tmp copy + diff against the original) 5. vault-policy-validate — HCL syntax + capability validation via `vault policy write` against an inline dev-mode Vault server (no offline `policy validate` subcommand exists; dev-mode writes are ephemeral so this is a validator, not a deploy) 6. vault-roles-validate — yamllint + PyYAML-based role→policy reference check (every role's `policy:` field must match a vault/policies/*.hcl basename; also checks the four required fields name/policy/namespace/job_id) Secret-scan coverage for vault/policies/*.hcl is already provided by the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger path — this pipeline intentionally does NOT duplicate that gate to avoid the inline-heredoc / YAML-parse failure mode that sank the prior attempt at this issue (PR #896). Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`. `lib/init/nomad/vault-*.sh` is already covered by the existing `lib/init/nomad/**` glob. Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the policy lifecycle, the CI enforcement table, and the common failure modes authors will see. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 18:15:03 +00:00
2 fails; a typo in a `path "..."` block in a vault policy fails step 5
with the Vault parser's error; a `roles.yaml` entry that points at a
policy basename that does not exist fails step 6. PRs that don't touch
any of the trigger paths skip this pipeline entirely.
## Version pinning
Nomad + Vault versions are pinned in **two** places — bumping one
without the other is a CI-caught drift:
- `lib/init/nomad/install.sh` — the apt-installed versions on factory
boxes (`NOMAD_VERSION`, `VAULT_VERSION`).
- `.woodpecker/nomad-validate.yml` — the `hashicorp/nomad:…` and
`hashicorp/vault:…` image tags used for static validation.
Bump both in the same PR. The CI pipeline will fail if the pinned
image's `config validate` rejects syntax the installed runtime would
accept (or vice versa).
## Related
- `lib/init/nomad/` — installer + systemd units + cluster-up orchestrator.
- `.woodpecker/nomad-validate.yml` — this directory's CI pipeline.
fix: [nomad-step-2] S2.6 — CI: vault policy fmt + validate + roles.yaml check (#884) Extend .woodpecker/nomad-validate.yml with three new fail-closed steps that guard every artifact under vault/policies/ and vault/roles.yaml before it can land: 4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5 has no `policy fmt -check` flag, so we build the non-destructive check out of `vault policy fmt` on a /tmp copy + diff against the original) 5. vault-policy-validate — HCL syntax + capability validation via `vault policy write` against an inline dev-mode Vault server (no offline `policy validate` subcommand exists; dev-mode writes are ephemeral so this is a validator, not a deploy) 6. vault-roles-validate — yamllint + PyYAML-based role→policy reference check (every role's `policy:` field must match a vault/policies/*.hcl basename; also checks the four required fields name/policy/namespace/job_id) Secret-scan coverage for vault/policies/*.hcl is already provided by the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger path — this pipeline intentionally does NOT duplicate that gate to avoid the inline-heredoc / YAML-parse failure mode that sank the prior attempt at this issue (PR #896). Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`. `lib/init/nomad/vault-*.sh` is already covered by the existing `lib/init/nomad/**` glob. Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the policy lifecycle, the CI enforcement table, and the common failure modes authors will see. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 18:15:03 +00:00
- `vault/policies/` — Vault ACL policy HCL files (S2.1); the
`vault-policy-fmt` / `vault-policy-validate` CI steps above enforce
their shape. See [`../vault/policies/AGENTS.md`](../vault/policies/AGENTS.md)
for the policy lifecycle, CI enforcement details, and common failure
modes.
- `vault/roles.yaml` — JWT-auth role → policy bindings (S2.3); the
`vault-roles-validate` CI step above keeps it in lockstep with the
policies directory.
- Top-of-file headers in `server.hcl` / `client.hcl` / `vault.hcl`
document the per-file ownership contract.