From 6e73c6dd1f86e576f5ae56071a64ff81a32595ab Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 18:15:03 +0000 Subject: [PATCH] =?UTF-8?q?fix:=20[nomad-step-2]=20S2.6=20=E2=80=94=20CI:?= =?UTF-8?q?=20vault=20policy=20fmt=20+=20validate=20+=20roles.yaml=20check?= =?UTF-8?q?=20(#884)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extend .woodpecker/nomad-validate.yml with three new fail-closed steps that guard every artifact under vault/policies/ and vault/roles.yaml before it can land: 4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5 has no `policy fmt -check` flag, so we build the non-destructive check out of `vault policy fmt` on a /tmp copy + diff against the original) 5. vault-policy-validate — HCL syntax + capability validation via `vault policy write` against an inline dev-mode Vault server (no offline `policy validate` subcommand exists; dev-mode writes are ephemeral so this is a validator, not a deploy) 6. vault-roles-validate — yamllint + PyYAML-based role→policy reference check (every role's `policy:` field must match a vault/policies/*.hcl basename; also checks the four required fields name/policy/namespace/job_id) Secret-scan coverage for vault/policies/*.hcl is already provided by the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger path — this pipeline intentionally does NOT duplicate that gate to avoid the inline-heredoc / YAML-parse failure mode that sank the prior attempt at this issue (PR #896). Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`. `lib/init/nomad/vault-*.sh` is already covered by the existing `lib/init/nomad/**` glob. Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the policy lifecycle, the CI enforcement table, and the common failure modes authors will see. Co-Authored-By: Claude Opus 4.6 (1M context) --- .woodpecker/nomad-validate.yml | 208 +++++++++++++++++++++++++++++++-- nomad/AGENTS.md | 48 +++++++- vault/policies/AGENTS.md | 64 +++++++++- 3 files changed, 300 insertions(+), 20 deletions(-) diff --git a/.woodpecker/nomad-validate.yml b/.woodpecker/nomad-validate.yml index 81e45ae..5a1cc7c 100644 --- a/.woodpecker/nomad-validate.yml +++ b/.woodpecker/nomad-validate.yml @@ -1,16 +1,21 @@ # ============================================================================= # .woodpecker/nomad-validate.yml — Static validation for Nomad+Vault artifacts # -# Part of the Nomad+Vault migration (S0.5, issue #825). Locks in the -# "no-ad-hoc-steps" principle: every HCL/shell artifact under nomad/ or -# lib/init/nomad/, plus the `disinto init` dispatcher, gets checked -# before it can land. +# Part of the Nomad+Vault migration (S0.5, issue #825; extended in S2.6, +# issue #884). Locks in the "no-ad-hoc-steps" principle: every HCL/shell +# artifact under nomad/, lib/init/nomad/, vault/policies/, plus the +# `disinto init` dispatcher and vault/roles.yaml, gets checked before it +# can land. # # Triggers on PRs (and pushes) that touch any of: # nomad/** — HCL configs (server, client, vault) -# lib/init/nomad/** — cluster-up / install / systemd / vault-init +# lib/init/nomad/** — cluster-up / install / systemd / vault-init / +# vault-nomad-auth (S2.6 trigger: vault-*.sh +# is a subset of this glob) # bin/disinto — `disinto init --backend=nomad` dispatcher # tests/disinto-init-nomad.bats — the bats suite itself +# vault/policies/** — Vault ACL policy HCL files (S2.1, S2.6) +# vault/roles.yaml — JWT-auth role bindings (S2.3, S2.6) # .woodpecker/nomad-validate.yml — the pipeline definition # # Steps (all fail-closed — any error blocks merge): @@ -19,8 +24,22 @@ # nomad/jobs/*.hcl (new jobspecs get # CI coverage automatically) # 3. vault-operator-diagnose — `vault operator diagnose` syntax check on vault.hcl -# 4. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto -# 5. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests +# 4. vault-policy-fmt — `vault policy fmt` idempotence check on +# every vault/policies/*.hcl (format drift = +# CI fail; non-destructive via cp+diff) +# 5. vault-policy-validate — HCL syntax + capability validation for every +# vault/policies/*.hcl via `vault policy write` +# against an inline dev-mode Vault server +# 6. vault-roles-validate — yamllint + role→policy reference check on +# vault/roles.yaml (every referenced policy +# must exist as vault/policies/.hcl) +# 7. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto +# 8. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests +# +# Secret-scan coverage: vault/policies/*.hcl is already scanned by the +# P11 gate (.woodpecker/secret-scan.yml, issue #798) — its trigger path +# `vault/**/*` covers everything under this directory. We intentionally +# do NOT duplicate that gate here; one scanner, one source of truth. # # Pinned image versions match lib/init/nomad/install.sh (nomad 1.9.5 / # vault 1.18.5). Bump there AND here together — drift = CI passing on @@ -34,6 +53,8 @@ when: - "lib/init/nomad/**" - "bin/disinto" - "tests/disinto-init-nomad.bats" + - "vault/policies/**" + - "vault/roles.yaml" - ".woodpecker/nomad-validate.yml" # Authenticated clone — same pattern as .woodpecker/ci.yml. Forgejo is @@ -123,7 +144,176 @@ steps: *) echo "vault config: hard failure (rc=$rc)" >&2; exit "$rc" ;; esac - # ── 4. Shellcheck ──────────────────────────────────────────────────────── + # ── 4. Vault policy fmt idempotence check ──────────────────────────────── + # `vault policy fmt ` formats a local HCL policy file in place. + # There's no `-check`/dry-run flag (vault 1.18.5), so we implement a + # non-destructive check as cp → fmt-on-copy → diff against original. + # Any diff means the committed file would be rewritten by `vault policy + # fmt` — failure steers the author to run `vault policy fmt ` + # locally before pushing. + # + # Scope: vault/policies/*.hcl only. The `[ -f "$f" ]` guard handles the + # no-match case (POSIX sh does not nullglob) so an empty policies/ + # directory does not fail this step. + # + # Note: `vault policy fmt` is purely local (HCL text transform) and does + # not require a running Vault server, which is why this step can run + # without starting one. + - name: vault-policy-fmt + image: hashicorp/vault:1.18.5 + commands: + - | + set -e + failed=0 + for f in vault/policies/*.hcl; do + [ -f "$f" ] || continue + tmp="/tmp/$(basename "$f").fmt" + cp "$f" "$tmp" + vault policy fmt "$tmp" >/dev/null 2>&1 + if ! diff -u "$f" "$tmp"; then + echo "ERROR: $f is not formatted — run 'vault policy fmt $f' locally" >&2 + failed=1 + fi + done + if [ "$failed" -gt 0 ]; then + echo "vault-policy-fmt: formatting drift detected" >&2 + exit 1 + fi + echo "vault-policy-fmt: all policies formatted correctly" + + # ── 5. Vault policy HCL syntax + capability validation ─────────────────── + # Vault has no offline `vault policy validate` subcommand — the closest + # in-CLI validator is `vault policy write`, which sends the HCL to a + # running server which parses it, checks capability names against the + # known set (read, list, create, update, delete, patch, sudo, deny), + # and rejects unknown stanzas / malformed path blocks. We start an + # inline dev-mode Vault (in-memory, no persistence, root token = "root") + # for the duration of this step and loop `vault policy write` over every + # vault/policies/*.hcl; the policies never leave the ephemeral dev + # server, so this is strictly a validator — not a deploy. + # + # Exit-code handling: + # - `vault policy write` exits 0 on success, non-zero on any parse / + # semantic error. We aggregate failures across all files so a single + # CI run surfaces every broken policy (not just the first). + # - The dev server is killed on any step exit via EXIT trap so the + # step tears down cleanly even on failure. + # + # Why dev-mode is sufficient: we're not persisting secrets, only asking + # Vault to parse policy text. The factory's production Vault is NOT + # contacted. + - name: vault-policy-validate + image: hashicorp/vault:1.18.5 + commands: + - | + set -e + vault server -dev -dev-root-token-id=root -dev-listen-address=127.0.0.1:8200 >/tmp/vault-dev.log 2>&1 & + VAULT_PID=$! + trap 'kill "$VAULT_PID" 2>/dev/null || true' EXIT INT TERM + export VAULT_ADDR=http://127.0.0.1:8200 + export VAULT_TOKEN=root + ready=0 + i=0 + while [ "$i" -lt 30 ]; do + if vault status >/dev/null 2>&1; then + ready=1 + break + fi + i=$((i + 1)) + sleep 0.5 + done + if [ "$ready" -ne 1 ]; then + echo "vault-policy-validate: dev server failed to start after 15s" >&2 + cat /tmp/vault-dev.log >&2 || true + exit 1 + fi + failed=0 + for f in vault/policies/*.hcl; do + [ -f "$f" ] || continue + name=$(basename "$f" .hcl) + echo "validate: $f" + if ! vault policy write "$name" "$f"; then + echo " ERROR: $f failed validation" >&2 + failed=1 + fi + done + if [ "$failed" -gt 0 ]; then + echo "vault-policy-validate: validation errors found" >&2 + exit 1 + fi + echo "vault-policy-validate: all policies valid" + + # ── 6. vault/roles.yaml validator ──────────────────────────────────────── + # Validates the JWT-auth role bindings file (S2.3). Two checks: + # + # a. `yamllint` — catches YAML syntax errors and indentation drift. + # Uses a relaxed config (line length bumped to 200) because + # roles.yaml's comments are wide by design. + # b. role → policy reference check — every role's `policy:` field + # must match a basename in vault/policies/*.hcl. A role pointing + # at a non-existent policy = runtime "permission denied" at job + # placement; catching the drift here turns it into a CI failure. + # Also verifies each role entry has the four required fields + # (name, policy, namespace, job_id) per the file's documented + # format. + # + # Parsing is done with PyYAML (the roles.yaml format is a strict + # subset that awk-level parsing in tools/vault-apply-roles.sh handles + # too, but PyYAML in CI gives us structural validation for free). If + # roles.yaml is ever absent (e.g. reverted), the step skips rather + # than fails — presence is enforced by S2.3's own tooling, not here. + - name: vault-roles-validate + image: python:3.12-alpine + commands: + - pip install --quiet --disable-pip-version-check pyyaml yamllint + - | + set -e + if [ ! -f vault/roles.yaml ]; then + echo "vault-roles-validate: vault/roles.yaml not present, skipping" + exit 0 + fi + yamllint -d '{extends: relaxed, rules: {line-length: {max: 200}}}' vault/roles.yaml + echo "vault-roles-validate: yamllint OK" + python3 - <<'PY' + import os + import sys + import yaml + + with open('vault/roles.yaml') as f: + data = yaml.safe_load(f) or {} + roles = data.get('roles') or [] + if not roles: + print("vault-roles-validate: no roles defined in vault/roles.yaml", file=sys.stderr) + sys.exit(1) + existing = { + os.path.splitext(e)[0] + for e in os.listdir('vault/policies') + if e.endswith('.hcl') + } + required = ('name', 'policy', 'namespace', 'job_id') + failed = 0 + for r in roles: + if not isinstance(r, dict): + print(f"ERROR: role entry is not a mapping: {r!r}", file=sys.stderr) + failed = 1 + continue + for field in required: + if r.get(field) in (None, ''): + print(f"ERROR: role entry missing required field '{field}': {r}", file=sys.stderr) + failed = 1 + policy = r.get('policy') + if policy and policy not in existing: + print( + f"ERROR: role '{r.get('name')}' references policy '{policy}' " + f"but vault/policies/{policy}.hcl does not exist", + file=sys.stderr, + ) + failed = 1 + sys.exit(failed) + PY + echo "vault-roles-validate: all role→policy references valid" + + # ── 7. Shellcheck ──────────────────────────────────────────────────────── # Covers the new lib/init/nomad/*.sh scripts plus bin/disinto (which owns # the backend dispatcher). bin/disinto has no .sh extension so the # repo-wide shellcheck in .woodpecker/ci.yml skips it — this step is the @@ -133,7 +323,7 @@ steps: commands: - shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto - # ── 5. bats: `disinto init --backend=nomad --dry-run` ──────────────────── + # ── 8. bats: `disinto init --backend=nomad --dry-run` ──────────────────── # Smoke-tests the CLI dispatcher: both --backend=nomad variants exit 0 # with the expected step list, and --backend=docker stays on the docker # path (regression guard). Pure dry-run — no sudo, no network. diff --git a/nomad/AGENTS.md b/nomad/AGENTS.md index 953a7b2..5be8336 100644 --- a/nomad/AGENTS.md +++ b/nomad/AGENTS.md @@ -59,8 +59,8 @@ it owns. ## How CI validates these files `.woodpecker/nomad-validate.yml` runs on every PR that touches `nomad/` -(including `nomad/jobs/`), `lib/init/nomad/`, or `bin/disinto`. Five -fail-closed steps: +(including `nomad/jobs/`), `lib/init/nomad/`, `bin/disinto`, +`vault/policies/`, or `vault/roles.yaml`. Eight fail-closed steps: 1. **`nomad config validate nomad/server.hcl nomad/client.hcl`** — parses the HCL, fails on unknown blocks, bad port ranges, invalid @@ -85,19 +85,47 @@ fail-closed steps: disables the runtime checks (CI containers don't have `/var/lib/vault/data` or port 8200). Exit 2 (advisory warnings only, e.g. TLS-disabled listener) is tolerated; exit 1 blocks merge. -4. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`** +4. **`vault policy fmt` idempotence check on every `vault/policies/*.hcl`** + (S2.6) — `vault policy fmt` has no `-check` flag in 1.18.5, so the + step copies each file to `/tmp`, runs `vault policy fmt` on the copy, + and diffs against the original. Any non-empty diff means the + committed file would be rewritten by `fmt` and the step fails — the + author is pointed at `vault policy fmt ` to heal the drift. +5. **`vault policy write`-based validation against an inline dev-mode Vault** + (S2.6) — Vault 1.18.5 has no offline `policy validate` subcommand; + the CI step starts a dev-mode server, loops `vault policy write + ` over each `vault/policies/*.hcl`, and aggregates + failures so one CI run surfaces every broken policy. The server is + ephemeral and torn down on step exit — no persistence, no real + secrets. Catches unknown capability names (e.g. `"frobnicate"`), + malformed `path` blocks, and other semantic errors `fmt` does not. +6. **`vault/roles.yaml` validator** (S2.6) — yamllint + a PyYAML-based + check that every role's `policy:` field matches a basename under + `vault/policies/`, and that every role entry carries all four + required fields (`name`, `policy`, `namespace`, `job_id`). Drift + between the two directories is a scheduling-time "permission denied" + in production; this step turns it into a CI failure at PR time. +7. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`** — all init/dispatcher shell clean. `bin/disinto` has no `.sh` extension so the repo-wide shellcheck in `.woodpecker/ci.yml` skips it — this is the one place it gets checked. -5. **`bats tests/disinto-init-nomad.bats`** +8. **`bats tests/disinto-init-nomad.bats`** — exercises the dispatcher: `disinto init --backend=nomad --dry-run`, `… --empty --dry-run`, and the `--backend=docker` regression guard. +**Secret-scan coverage.** Policy HCL files under `vault/policies/` are +already swept by the P11 secret-scan gate +(`.woodpecker/secret-scan.yml`, #798), whose `vault/**/*` trigger path +covers everything in this directory. `nomad-validate.yml` intentionally +does NOT duplicate that gate — one scanner, one source of truth. + If a PR breaks `nomad/server.hcl` (e.g. typo in a block name), step 1 fails with a clear error; if it breaks a jobspec (e.g. misspells `task` as `tsak`, or adds a `volume` stanza without a `source`), step -2 fails instead. The fix makes it pass. PRs that don't touch any of -the trigger paths skip this pipeline entirely. +2 fails; a typo in a `path "..."` block in a vault policy fails step 5 +with the Vault parser's error; a `roles.yaml` entry that points at a +policy basename that does not exist fails step 6. PRs that don't touch +any of the trigger paths skip this pipeline entirely. ## Version pinning @@ -117,5 +145,13 @@ accept (or vice versa). - `lib/init/nomad/` — installer + systemd units + cluster-up orchestrator. - `.woodpecker/nomad-validate.yml` — this directory's CI pipeline. +- `vault/policies/` — Vault ACL policy HCL files (S2.1); the + `vault-policy-fmt` / `vault-policy-validate` CI steps above enforce + their shape. See [`../vault/policies/AGENTS.md`](../vault/policies/AGENTS.md) + for the policy lifecycle, CI enforcement details, and common failure + modes. +- `vault/roles.yaml` — JWT-auth role → policy bindings (S2.3); the + `vault-roles-validate` CI step above keeps it in lockstep with the + policies directory. - Top-of-file headers in `server.hcl` / `client.hcl` / `vault.hcl` document the per-file ownership contract. diff --git a/vault/policies/AGENTS.md b/vault/policies/AGENTS.md index edaf21c..ff1f403 100644 --- a/vault/policies/AGENTS.md +++ b/vault/policies/AGENTS.md @@ -48,12 +48,17 @@ validation. 1. Drop a file matching one of the four naming patterns above. Use an existing file in the same family as the template — comment header, capability list, and KV path layout should match the family. -2. Run `tools/vault-apply-policies.sh --dry-run` to confirm the new +2. Run `vault policy fmt ` locally so the formatting matches what + the CI fmt-check (step 4 of `.woodpecker/nomad-validate.yml`) will + accept. The fmt check runs non-destructively in CI but a dirty file + fails the step; running `fmt` locally before pushing is the fastest + path. +3. Add the matching entry to `../roles.yaml` (see "JWT-auth roles" below) + so the CI role-reference check (step 6) stays green. +4. Run `tools/vault-apply-policies.sh --dry-run` to confirm the new basename appears in the planned-work list with the expected SHA. -3. Run `tools/vault-apply-policies.sh` against a Vault instance to +5. Run `tools/vault-apply-policies.sh` against a Vault instance to create it; re-run to confirm it reports `unchanged`. -4. The CI fmt + validate step lands in S2.6 (#884). Until then - `vault policy fmt ` locally is the fastest sanity check. ## JWT-auth roles (S2.3) @@ -117,6 +122,56 @@ would let one service's tokens outlive the others — add a field to `vault/roles.yaml` and the applier at the same time if that ever becomes necessary. +## Policy lifecycle + +Adding a policy that an actual workload consumes is a three-step chain; +the CI pipeline guards each link. + +1. **Add the policy HCL** — `vault/policies/.hcl`, formatted with + `vault policy fmt`. Capabilities must be drawn from the Vault-recognized + set (`read`, `list`, `create`, `update`, `delete`, `patch`, `sudo`, + `deny`); a typo fails CI step 5 (HCL written to an inline dev-mode Vault + via `vault policy write` — a real parser, not a regex). +2. **Update `../roles.yaml`** — add a JWT-auth role entry whose `policy:` + field matches the new basename (without `.hcl`). CI step 6 re-checks + every role in this file against the policy set, so a drift between the + two directories fails the step. +3. **Reference from a Nomad jobspec** — add `vault { role = "" }` in + `nomad/jobs/.hcl` (owned by S2.4). Policies do not take effect + until a Nomad job asks for a token via that role. + +See the "Adding a new service" walkthrough below for the applier-script +flow once steps 1–3 are committed. + +## CI enforcement (`.woodpecker/nomad-validate.yml`) + +The pipeline triggers on any PR touching `vault/policies/**`, +`vault/roles.yaml`, or `lib/init/nomad/vault-*.sh` and runs four +vault-scoped checks (in addition to the nomad-scoped steps already in +place): + +| Step | Tool | What it catches | +|---|---|---| +| 4. `vault-policy-fmt` | `vault policy fmt` + `diff` | formatting drift — trailing whitespace, wrong indentation, missing newlines | +| 5. `vault-policy-validate` | `vault policy write` against inline dev Vault | HCL syntax errors, unknown stanzas, invalid capability names (e.g. `"frobnicate"`), malformed `path "..." {}` blocks | +| 6. `vault-roles-validate` | yamllint + PyYAML | roles.yaml syntax drift, missing required fields, role→policy references with no matching `.hcl` | +| P11 | `lib/secret-scan.sh` via `.woodpecker/secret-scan.yml` | literal secret leaked into a policy HCL (rare copy-paste mistake) — already covers `vault/**/*`, no duplicate step here | + +All four steps are fail-closed — any error blocks merge. The pipeline +pins `hashicorp/vault:1.18.5` (matching `lib/init/nomad/install.sh`); +bumping the runtime version without bumping the CI image is a CI-caught +drift. + +## Common failure modes + +| Symptom in CI logs | Root cause | Fix | +|---|---|---| +| `vault-policy-fmt: … is not formatted — run 'vault policy fmt '` | Trailing whitespace / mixed indent in an HCL file | `vault policy fmt ` locally and re-commit | +| `vault-policy-validate: … failed validation` plus a `policy` error from Vault | Unknown capability (e.g. `"frobnicate"`), unknown stanza, malformed `path` block | Fix the HCL; valid capabilities are `read`, `list`, `create`, `update`, `delete`, `patch`, `sudo`, `deny` | +| `vault-roles-validate: ERROR: role 'X' references policy 'Y' but vault/policies/Y.hcl does not exist` | A role's `policy:` field does not match any file basename in `vault/policies/` | Either add the missing policy HCL or fix the typo in `roles.yaml` | +| `vault-roles-validate: ERROR: role entry missing required field 'Z'` | A role in `roles.yaml` is missing one of `name`, `policy`, `namespace`, `job_id` | Add the field; all four are required | +| P11 `secret-scan: detected potential secret …` on a `.hcl` file | A literal token/password was pasted into a policy | Policies must name KV paths, not carry secret values — move the literal into KV (S2.2) and have the policy grant `read` on the path | + ## What this directory does NOT own - **Attaching policies to Nomad jobs.** That's S2.4 (#882) via the @@ -124,4 +179,3 @@ becomes necessary. name in `vault { role = "..." }` is what binds the policy. - **Writing the secret values themselves.** That's S2.2 (#880) via `tools/vault-import.sh`. -- **CI policy fmt + validate + roles.yaml check.** That's S2.6 (#884).