Extend .woodpecker/nomad-validate.yml with three new fail-closed steps
that guard every artifact under vault/policies/ and vault/roles.yaml
before it can land:
4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5
has no `policy fmt -check` flag, so we
build the non-destructive check out of
`vault policy fmt` on a /tmp copy + diff
against the original)
5. vault-policy-validate — HCL syntax + capability validation via
`vault policy write` against an inline
dev-mode Vault server (no offline
`policy validate` subcommand exists;
dev-mode writes are ephemeral so this is
a validator, not a deploy)
6. vault-roles-validate — yamllint + PyYAML-based role→policy
reference check (every role's `policy:`
field must match a vault/policies/*.hcl
basename; also checks the four required
fields name/policy/namespace/job_id)
Secret-scan coverage for vault/policies/*.hcl is already provided by
the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger
path — this pipeline intentionally does NOT duplicate that gate to
avoid the inline-heredoc / YAML-parse failure mode that sank the prior
attempt at this issue (PR #896).
Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`.
`lib/init/nomad/vault-*.sh` is already covered by the existing
`lib/init/nomad/**` glob.
Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the
policy lifecycle, the CI enforcement table, and the common failure
modes authors will see.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8.4 KiB
nomad/ — Agent Instructions
Nomad + Vault HCL for the factory's single-node cluster. These files are
the source of truth that lib/init/nomad/cluster-up.sh copies onto a
factory box under /etc/nomad.d/ and /etc/vault.d/ at init time.
This directory is part of the Nomad+Vault migration (Step 0) — see issues #821–#825 for the step breakdown. Jobspecs land in Step 1.
What lives here
| File | Deployed to | Owned by |
|---|---|---|
server.hcl |
/etc/nomad.d/server.hcl |
agent role, bind, ports, data_dir (S0.2) |
client.hcl |
/etc/nomad.d/client.hcl |
Docker driver cfg + host_volume declarations (S0.2) |
vault.hcl |
/etc/vault.d/vault.hcl |
Vault storage, listener, UI, disable_mlock (S0.3) |
Nomad auto-merges every *.hcl under -config=/etc/nomad.d/, so the
split between server.hcl and client.hcl is for readability, not
semantics. The top-of-file header in each config documents which blocks
it owns.
What does NOT live here yet
- Jobspecs. Step 0 brings up an empty cluster. Step 1 (and later)
adds
*.hcljob files for forgejo, woodpecker, agents, caddy, etc. When that lands, jobspecs will live innomad/jobs/and each will get its own header comment pointing to thehost_volumenames it consumes (volume = "forgejo-data", etc. — declared inclient.hcl). - TLS, ACLs, gossip encryption. Deliberately absent in Step 0 — factory traffic stays on localhost. These land in later migration steps alongside multi-node support.
Adding a jobspec (Step 1 and later)
- Drop a file in
nomad/jobs/<service>.hcl. The.hclsuffix is load-bearing:.woodpecker/nomad-validate.ymlglobs on exactly that suffix to auto-pick up new jobspecs (see step 2 in "How CI validates these files" below). Anything else innomad/jobs/is silently skipped by CI. - If it needs persistent state, reference a
host_volumealready declared inclient.hcl— don't add ad-hoc host paths in the jobspec. If a new volume is needed, add it to both:nomad/client.hcl— thehost_volume "<name>" { path = … }blocklib/init/nomad/cluster-up.sh— theHOST_VOLUME_DIRSarray The two must stay in sync or nomad fingerprinting will fail and the node stays in "initializing". Note that offlinenomad job validatewill NOT catch a typo in the jobspec'ssource = "..."against the client.hcl host_volume list (see step 2 below) — the scheduler rejects the mismatch at placement time instead.
- Pin image tags —
image = "forgejo/forgejo:1.22.5", not:latest. - No pipeline edit required — step 2 of
nomad-validate.ymlglobs overnomad/jobs/*.hcland validates every match. Just make sure the existingnomad/**trigger path still covers your file (it does for anything undernomad/jobs/).
How CI validates these files
.woodpecker/nomad-validate.yml runs on every PR that touches nomad/
(including nomad/jobs/), lib/init/nomad/, bin/disinto,
vault/policies/, or vault/roles.yaml. Eight fail-closed steps:
nomad config validate nomad/server.hcl nomad/client.hcl— parses the HCL, fails on unknown blocks, bad port ranges, invalid driver config. Vault HCL is excluded (different tool). Jobspecs are excluded too — agent-config and jobspec are disjoint HCL grammars; running this step on a jobspec rejects it with "unknown block 'job'".nomad job validate nomad/jobs/*.hcl(loop, one call per file) — parses each jobspec's HCL, fails on unknown stanzas, missing required fields, wrong value types, invalid driver config. Runs offline (no Nomad server needed) so CI exit 0 ≠ "this will schedule successfully"; it means "the HCL itself is well-formed". What this step does NOT catch:- cross-file references (
source = "forgejo-data"typo against thehost_volumelist inclient.hcl) — that's a scheduling-time check on the live cluster, not validate-time. - image reachability —
image = "codeberg.org/forgejo/forgejo:11.0"is accepted even if the registry is down or the tag is wrong. New jobspecs are picked up automatically by the glob — no pipeline edit needed as long as the file is named<name>.hcl.
- cross-file references (
vault operator diagnose -config=nomad/vault.hcl -skip=storage -skip=listener— Vault's equivalent syntax + schema check.-skip=storage/listenerdisables the runtime checks (CI containers don't have/var/lib/vault/dataor port 8200). Exit 2 (advisory warnings only, e.g. TLS-disabled listener) is tolerated; exit 1 blocks merge.vault policy fmtidempotence check on everyvault/policies/*.hcl(S2.6) —vault policy fmthas no-checkflag in 1.18.5, so the step copies each file to/tmp, runsvault policy fmton the copy, and diffs against the original. Any non-empty diff means the committed file would be rewritten byfmtand the step fails — the author is pointed atvault policy fmt <file>to heal the drift.vault policy write-based validation against an inline dev-mode Vault (S2.6) — Vault 1.18.5 has no offlinepolicy validatesubcommand; the CI step starts a dev-mode server, loopsvault policy write <basename> <file>over eachvault/policies/*.hcl, and aggregates failures so one CI run surfaces every broken policy. The server is ephemeral and torn down on step exit — no persistence, no real secrets. Catches unknown capability names (e.g."frobnicate"), malformedpathblocks, and other semantic errorsfmtdoes not.vault/roles.yamlvalidator (S2.6) — yamllint + a PyYAML-based check that every role'spolicy:field matches a basename undervault/policies/, and that every role entry carries all four required fields (name,policy,namespace,job_id). Drift between the two directories is a scheduling-time "permission denied" in production; this step turns it into a CI failure at PR time.shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto— all init/dispatcher shell clean.bin/disintohas no.shextension so the repo-wide shellcheck in.woodpecker/ci.ymlskips it — this is the one place it gets checked.bats tests/disinto-init-nomad.bats— exercises the dispatcher:disinto init --backend=nomad --dry-run,… --empty --dry-run, and the--backend=dockerregression guard.
Secret-scan coverage. Policy HCL files under vault/policies/ are
already swept by the P11 secret-scan gate
(.woodpecker/secret-scan.yml, #798), whose vault/**/* trigger path
covers everything in this directory. nomad-validate.yml intentionally
does NOT duplicate that gate — one scanner, one source of truth.
If a PR breaks nomad/server.hcl (e.g. typo in a block name), step 1
fails with a clear error; if it breaks a jobspec (e.g. misspells
task as tsak, or adds a volume stanza without a source), step
2 fails; a typo in a path "..." block in a vault policy fails step 5
with the Vault parser's error; a roles.yaml entry that points at a
policy basename that does not exist fails step 6. PRs that don't touch
any of the trigger paths skip this pipeline entirely.
Version pinning
Nomad + Vault versions are pinned in two places — bumping one without the other is a CI-caught drift:
lib/init/nomad/install.sh— the apt-installed versions on factory boxes (NOMAD_VERSION,VAULT_VERSION)..woodpecker/nomad-validate.yml— thehashicorp/nomad:…andhashicorp/vault:…image tags used for static validation.
Bump both in the same PR. The CI pipeline will fail if the pinned
image's config validate rejects syntax the installed runtime would
accept (or vice versa).
Related
lib/init/nomad/— installer + systemd units + cluster-up orchestrator..woodpecker/nomad-validate.yml— this directory's CI pipeline.vault/policies/— Vault ACL policy HCL files (S2.1); thevault-policy-fmt/vault-policy-validateCI steps above enforce their shape. See../vault/policies/AGENTS.mdfor the policy lifecycle, CI enforcement details, and common failure modes.vault/roles.yaml— JWT-auth role → policy bindings (S2.3); thevault-roles-validateCI step above keeps it in lockstep with the policies directory.- Top-of-file headers in
server.hcl/client.hcl/vault.hcldocument the per-file ownership contract.