diff --git a/.woodpecker/nomad-validate.yml b/.woodpecker/nomad-validate.yml index 5a1cc7c..81e45ae 100644 --- a/.woodpecker/nomad-validate.yml +++ b/.woodpecker/nomad-validate.yml @@ -1,21 +1,16 @@ # ============================================================================= # .woodpecker/nomad-validate.yml — Static validation for Nomad+Vault artifacts # -# Part of the Nomad+Vault migration (S0.5, issue #825; extended in S2.6, -# issue #884). Locks in the "no-ad-hoc-steps" principle: every HCL/shell -# artifact under nomad/, lib/init/nomad/, vault/policies/, plus the -# `disinto init` dispatcher and vault/roles.yaml, gets checked before it -# can land. +# Part of the Nomad+Vault migration (S0.5, issue #825). Locks in the +# "no-ad-hoc-steps" principle: every HCL/shell artifact under nomad/ or +# lib/init/nomad/, plus the `disinto init` dispatcher, gets checked +# before it can land. # # Triggers on PRs (and pushes) that touch any of: # nomad/** — HCL configs (server, client, vault) -# lib/init/nomad/** — cluster-up / install / systemd / vault-init / -# vault-nomad-auth (S2.6 trigger: vault-*.sh -# is a subset of this glob) +# lib/init/nomad/** — cluster-up / install / systemd / vault-init # bin/disinto — `disinto init --backend=nomad` dispatcher # tests/disinto-init-nomad.bats — the bats suite itself -# vault/policies/** — Vault ACL policy HCL files (S2.1, S2.6) -# vault/roles.yaml — JWT-auth role bindings (S2.3, S2.6) # .woodpecker/nomad-validate.yml — the pipeline definition # # Steps (all fail-closed — any error blocks merge): @@ -24,22 +19,8 @@ # nomad/jobs/*.hcl (new jobspecs get # CI coverage automatically) # 3. vault-operator-diagnose — `vault operator diagnose` syntax check on vault.hcl -# 4. vault-policy-fmt — `vault policy fmt` idempotence check on -# every vault/policies/*.hcl (format drift = -# CI fail; non-destructive via cp+diff) -# 5. vault-policy-validate — HCL syntax + capability validation for every -# vault/policies/*.hcl via `vault policy write` -# against an inline dev-mode Vault server -# 6. vault-roles-validate — yamllint + role→policy reference check on -# vault/roles.yaml (every referenced policy -# must exist as vault/policies/.hcl) -# 7. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto -# 8. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests -# -# Secret-scan coverage: vault/policies/*.hcl is already scanned by the -# P11 gate (.woodpecker/secret-scan.yml, issue #798) — its trigger path -# `vault/**/*` covers everything under this directory. We intentionally -# do NOT duplicate that gate here; one scanner, one source of truth. +# 4. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto +# 5. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests # # Pinned image versions match lib/init/nomad/install.sh (nomad 1.9.5 / # vault 1.18.5). Bump there AND here together — drift = CI passing on @@ -53,8 +34,6 @@ when: - "lib/init/nomad/**" - "bin/disinto" - "tests/disinto-init-nomad.bats" - - "vault/policies/**" - - "vault/roles.yaml" - ".woodpecker/nomad-validate.yml" # Authenticated clone — same pattern as .woodpecker/ci.yml. Forgejo is @@ -144,176 +123,7 @@ steps: *) echo "vault config: hard failure (rc=$rc)" >&2; exit "$rc" ;; esac - # ── 4. Vault policy fmt idempotence check ──────────────────────────────── - # `vault policy fmt ` formats a local HCL policy file in place. - # There's no `-check`/dry-run flag (vault 1.18.5), so we implement a - # non-destructive check as cp → fmt-on-copy → diff against original. - # Any diff means the committed file would be rewritten by `vault policy - # fmt` — failure steers the author to run `vault policy fmt ` - # locally before pushing. - # - # Scope: vault/policies/*.hcl only. The `[ -f "$f" ]` guard handles the - # no-match case (POSIX sh does not nullglob) so an empty policies/ - # directory does not fail this step. - # - # Note: `vault policy fmt` is purely local (HCL text transform) and does - # not require a running Vault server, which is why this step can run - # without starting one. - - name: vault-policy-fmt - image: hashicorp/vault:1.18.5 - commands: - - | - set -e - failed=0 - for f in vault/policies/*.hcl; do - [ -f "$f" ] || continue - tmp="/tmp/$(basename "$f").fmt" - cp "$f" "$tmp" - vault policy fmt "$tmp" >/dev/null 2>&1 - if ! diff -u "$f" "$tmp"; then - echo "ERROR: $f is not formatted — run 'vault policy fmt $f' locally" >&2 - failed=1 - fi - done - if [ "$failed" -gt 0 ]; then - echo "vault-policy-fmt: formatting drift detected" >&2 - exit 1 - fi - echo "vault-policy-fmt: all policies formatted correctly" - - # ── 5. Vault policy HCL syntax + capability validation ─────────────────── - # Vault has no offline `vault policy validate` subcommand — the closest - # in-CLI validator is `vault policy write`, which sends the HCL to a - # running server which parses it, checks capability names against the - # known set (read, list, create, update, delete, patch, sudo, deny), - # and rejects unknown stanzas / malformed path blocks. We start an - # inline dev-mode Vault (in-memory, no persistence, root token = "root") - # for the duration of this step and loop `vault policy write` over every - # vault/policies/*.hcl; the policies never leave the ephemeral dev - # server, so this is strictly a validator — not a deploy. - # - # Exit-code handling: - # - `vault policy write` exits 0 on success, non-zero on any parse / - # semantic error. We aggregate failures across all files so a single - # CI run surfaces every broken policy (not just the first). - # - The dev server is killed on any step exit via EXIT trap so the - # step tears down cleanly even on failure. - # - # Why dev-mode is sufficient: we're not persisting secrets, only asking - # Vault to parse policy text. The factory's production Vault is NOT - # contacted. - - name: vault-policy-validate - image: hashicorp/vault:1.18.5 - commands: - - | - set -e - vault server -dev -dev-root-token-id=root -dev-listen-address=127.0.0.1:8200 >/tmp/vault-dev.log 2>&1 & - VAULT_PID=$! - trap 'kill "$VAULT_PID" 2>/dev/null || true' EXIT INT TERM - export VAULT_ADDR=http://127.0.0.1:8200 - export VAULT_TOKEN=root - ready=0 - i=0 - while [ "$i" -lt 30 ]; do - if vault status >/dev/null 2>&1; then - ready=1 - break - fi - i=$((i + 1)) - sleep 0.5 - done - if [ "$ready" -ne 1 ]; then - echo "vault-policy-validate: dev server failed to start after 15s" >&2 - cat /tmp/vault-dev.log >&2 || true - exit 1 - fi - failed=0 - for f in vault/policies/*.hcl; do - [ -f "$f" ] || continue - name=$(basename "$f" .hcl) - echo "validate: $f" - if ! vault policy write "$name" "$f"; then - echo " ERROR: $f failed validation" >&2 - failed=1 - fi - done - if [ "$failed" -gt 0 ]; then - echo "vault-policy-validate: validation errors found" >&2 - exit 1 - fi - echo "vault-policy-validate: all policies valid" - - # ── 6. vault/roles.yaml validator ──────────────────────────────────────── - # Validates the JWT-auth role bindings file (S2.3). Two checks: - # - # a. `yamllint` — catches YAML syntax errors and indentation drift. - # Uses a relaxed config (line length bumped to 200) because - # roles.yaml's comments are wide by design. - # b. role → policy reference check — every role's `policy:` field - # must match a basename in vault/policies/*.hcl. A role pointing - # at a non-existent policy = runtime "permission denied" at job - # placement; catching the drift here turns it into a CI failure. - # Also verifies each role entry has the four required fields - # (name, policy, namespace, job_id) per the file's documented - # format. - # - # Parsing is done with PyYAML (the roles.yaml format is a strict - # subset that awk-level parsing in tools/vault-apply-roles.sh handles - # too, but PyYAML in CI gives us structural validation for free). If - # roles.yaml is ever absent (e.g. reverted), the step skips rather - # than fails — presence is enforced by S2.3's own tooling, not here. - - name: vault-roles-validate - image: python:3.12-alpine - commands: - - pip install --quiet --disable-pip-version-check pyyaml yamllint - - | - set -e - if [ ! -f vault/roles.yaml ]; then - echo "vault-roles-validate: vault/roles.yaml not present, skipping" - exit 0 - fi - yamllint -d '{extends: relaxed, rules: {line-length: {max: 200}}}' vault/roles.yaml - echo "vault-roles-validate: yamllint OK" - python3 - <<'PY' - import os - import sys - import yaml - - with open('vault/roles.yaml') as f: - data = yaml.safe_load(f) or {} - roles = data.get('roles') or [] - if not roles: - print("vault-roles-validate: no roles defined in vault/roles.yaml", file=sys.stderr) - sys.exit(1) - existing = { - os.path.splitext(e)[0] - for e in os.listdir('vault/policies') - if e.endswith('.hcl') - } - required = ('name', 'policy', 'namespace', 'job_id') - failed = 0 - for r in roles: - if not isinstance(r, dict): - print(f"ERROR: role entry is not a mapping: {r!r}", file=sys.stderr) - failed = 1 - continue - for field in required: - if r.get(field) in (None, ''): - print(f"ERROR: role entry missing required field '{field}': {r}", file=sys.stderr) - failed = 1 - policy = r.get('policy') - if policy and policy not in existing: - print( - f"ERROR: role '{r.get('name')}' references policy '{policy}' " - f"but vault/policies/{policy}.hcl does not exist", - file=sys.stderr, - ) - failed = 1 - sys.exit(failed) - PY - echo "vault-roles-validate: all role→policy references valid" - - # ── 7. Shellcheck ──────────────────────────────────────────────────────── + # ── 4. Shellcheck ──────────────────────────────────────────────────────── # Covers the new lib/init/nomad/*.sh scripts plus bin/disinto (which owns # the backend dispatcher). bin/disinto has no .sh extension so the # repo-wide shellcheck in .woodpecker/ci.yml skips it — this step is the @@ -323,7 +133,7 @@ steps: commands: - shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto - # ── 8. bats: `disinto init --backend=nomad --dry-run` ──────────────────── + # ── 5. bats: `disinto init --backend=nomad --dry-run` ──────────────────── # Smoke-tests the CLI dispatcher: both --backend=nomad variants exit 0 # with the expected step list, and --backend=docker stays on the docker # path (regression guard). Pure dry-run — no sudo, no network. diff --git a/lib/action-vault.sh b/lib/action-vault.sh index 7bed943..8826aa3 100644 --- a/lib/action-vault.sh +++ b/lib/action-vault.sh @@ -135,6 +135,7 @@ vault_request() { local vault_env="${FACTORY_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}/action-vault/vault-env.sh" if [ ! -f "$vault_env" ]; then echo "ERROR: vault-env.sh not found at $vault_env" >&2 + rm -f "$tmp_toml" return 1 fi @@ -144,13 +145,14 @@ vault_request() { if ! source "$vault_env"; then FORGE_TOKEN="${_saved_forge_token:-}" echo "ERROR: failed to source vault-env.sh" >&2 + rm -f "$tmp_toml" return 1 fi # Restore caller's FORGE_TOKEN after validation FORGE_TOKEN="${_saved_forge_token:-}" # Set trap AFTER sourcing vault-env.sh to avoid RETURN trap firing during source - trap 'rm -f "$tmp_toml"' RETURN + trap 'rm -f "$tmp_toml"' EXIT # Run validation if ! validate_vault_action "$tmp_toml"; then diff --git a/nomad/AGENTS.md b/nomad/AGENTS.md index 0ce3cea..25695f8 100644 --- a/nomad/AGENTS.md +++ b/nomad/AGENTS.md @@ -61,8 +61,8 @@ convention, KV path summary, and JWT-auth role bindings (S2.1/S2.3). ## How CI validates these files `.woodpecker/nomad-validate.yml` runs on every PR that touches `nomad/` -(including `nomad/jobs/`), `lib/init/nomad/`, `bin/disinto`, -`vault/policies/`, or `vault/roles.yaml`. Eight fail-closed steps: +(including `nomad/jobs/`), `lib/init/nomad/`, or `bin/disinto`. Five +fail-closed steps: 1. **`nomad config validate nomad/server.hcl nomad/client.hcl`** — parses the HCL, fails on unknown blocks, bad port ranges, invalid @@ -87,47 +87,19 @@ convention, KV path summary, and JWT-auth role bindings (S2.1/S2.3). disables the runtime checks (CI containers don't have `/var/lib/vault/data` or port 8200). Exit 2 (advisory warnings only, e.g. TLS-disabled listener) is tolerated; exit 1 blocks merge. -4. **`vault policy fmt` idempotence check on every `vault/policies/*.hcl`** - (S2.6) — `vault policy fmt` has no `-check` flag in 1.18.5, so the - step copies each file to `/tmp`, runs `vault policy fmt` on the copy, - and diffs against the original. Any non-empty diff means the - committed file would be rewritten by `fmt` and the step fails — the - author is pointed at `vault policy fmt ` to heal the drift. -5. **`vault policy write`-based validation against an inline dev-mode Vault** - (S2.6) — Vault 1.18.5 has no offline `policy validate` subcommand; - the CI step starts a dev-mode server, loops `vault policy write - ` over each `vault/policies/*.hcl`, and aggregates - failures so one CI run surfaces every broken policy. The server is - ephemeral and torn down on step exit — no persistence, no real - secrets. Catches unknown capability names (e.g. `"frobnicate"`), - malformed `path` blocks, and other semantic errors `fmt` does not. -6. **`vault/roles.yaml` validator** (S2.6) — yamllint + a PyYAML-based - check that every role's `policy:` field matches a basename under - `vault/policies/`, and that every role entry carries all four - required fields (`name`, `policy`, `namespace`, `job_id`). Drift - between the two directories is a scheduling-time "permission denied" - in production; this step turns it into a CI failure at PR time. -7. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`** +4. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`** — all init/dispatcher shell clean. `bin/disinto` has no `.sh` extension so the repo-wide shellcheck in `.woodpecker/ci.yml` skips it — this is the one place it gets checked. -8. **`bats tests/disinto-init-nomad.bats`** +5. **`bats tests/disinto-init-nomad.bats`** — exercises the dispatcher: `disinto init --backend=nomad --dry-run`, `… --empty --dry-run`, and the `--backend=docker` regression guard. -**Secret-scan coverage.** Policy HCL files under `vault/policies/` are -already swept by the P11 secret-scan gate -(`.woodpecker/secret-scan.yml`, #798), whose `vault/**/*` trigger path -covers everything in this directory. `nomad-validate.yml` intentionally -does NOT duplicate that gate — one scanner, one source of truth. - If a PR breaks `nomad/server.hcl` (e.g. typo in a block name), step 1 fails with a clear error; if it breaks a jobspec (e.g. misspells `task` as `tsak`, or adds a `volume` stanza without a `source`), step -2 fails; a typo in a `path "..."` block in a vault policy fails step 5 -with the Vault parser's error; a `roles.yaml` entry that points at a -policy basename that does not exist fails step 6. PRs that don't touch -any of the trigger paths skip this pipeline entirely. +2 fails instead. The fix makes it pass. PRs that don't touch any of +the trigger paths skip this pipeline entirely. ## Version pinning @@ -147,13 +119,5 @@ accept (or vice versa). - `lib/init/nomad/` — installer + systemd units + cluster-up orchestrator. - `.woodpecker/nomad-validate.yml` — this directory's CI pipeline. -- `vault/policies/` — Vault ACL policy HCL files (S2.1); the - `vault-policy-fmt` / `vault-policy-validate` CI steps above enforce - their shape. See [`../vault/policies/AGENTS.md`](../vault/policies/AGENTS.md) - for the policy lifecycle, CI enforcement details, and common failure - modes. -- `vault/roles.yaml` — JWT-auth role → policy bindings (S2.3); the - `vault-roles-validate` CI step above keeps it in lockstep with the - policies directory. - Top-of-file headers in `server.hcl` / `client.hcl` / `vault.hcl` document the per-file ownership contract. diff --git a/vault/policies/AGENTS.md b/vault/policies/AGENTS.md index a1b85c2..21d3e4e 100644 --- a/vault/policies/AGENTS.md +++ b/vault/policies/AGENTS.md @@ -49,17 +49,12 @@ validation. 1. Drop a file matching one of the four naming patterns above. Use an existing file in the same family as the template — comment header, capability list, and KV path layout should match the family. -2. Run `vault policy fmt ` locally so the formatting matches what - the CI fmt-check (step 4 of `.woodpecker/nomad-validate.yml`) will - accept. The fmt check runs non-destructively in CI but a dirty file - fails the step; running `fmt` locally before pushing is the fastest - path. -3. Add the matching entry to `../roles.yaml` (see "JWT-auth roles" below) - so the CI role-reference check (step 6) stays green. -4. Run `tools/vault-apply-policies.sh --dry-run` to confirm the new +2. Run `tools/vault-apply-policies.sh --dry-run` to confirm the new basename appears in the planned-work list with the expected SHA. -5. Run `tools/vault-apply-policies.sh` against a Vault instance to +3. Run `tools/vault-apply-policies.sh` against a Vault instance to create it; re-run to confirm it reports `unchanged`. +4. The CI fmt + validate step lands in S2.6 (#884). Until then + `vault policy fmt ` locally is the fastest sanity check. ## JWT-auth roles (S2.3) @@ -123,56 +118,6 @@ would let one service's tokens outlive the others — add a field to `vault/roles.yaml` and the applier at the same time if that ever becomes necessary. -## Policy lifecycle - -Adding a policy that an actual workload consumes is a three-step chain; -the CI pipeline guards each link. - -1. **Add the policy HCL** — `vault/policies/.hcl`, formatted with - `vault policy fmt`. Capabilities must be drawn from the Vault-recognized - set (`read`, `list`, `create`, `update`, `delete`, `patch`, `sudo`, - `deny`); a typo fails CI step 5 (HCL written to an inline dev-mode Vault - via `vault policy write` — a real parser, not a regex). -2. **Update `../roles.yaml`** — add a JWT-auth role entry whose `policy:` - field matches the new basename (without `.hcl`). CI step 6 re-checks - every role in this file against the policy set, so a drift between the - two directories fails the step. -3. **Reference from a Nomad jobspec** — add `vault { role = "" }` in - `nomad/jobs/.hcl` (owned by S2.4). Policies do not take effect - until a Nomad job asks for a token via that role. - -See the "Adding a new service" walkthrough below for the applier-script -flow once steps 1–3 are committed. - -## CI enforcement (`.woodpecker/nomad-validate.yml`) - -The pipeline triggers on any PR touching `vault/policies/**`, -`vault/roles.yaml`, or `lib/init/nomad/vault-*.sh` and runs four -vault-scoped checks (in addition to the nomad-scoped steps already in -place): - -| Step | Tool | What it catches | -|---|---|---| -| 4. `vault-policy-fmt` | `vault policy fmt` + `diff` | formatting drift — trailing whitespace, wrong indentation, missing newlines | -| 5. `vault-policy-validate` | `vault policy write` against inline dev Vault | HCL syntax errors, unknown stanzas, invalid capability names (e.g. `"frobnicate"`), malformed `path "..." {}` blocks | -| 6. `vault-roles-validate` | yamllint + PyYAML | roles.yaml syntax drift, missing required fields, role→policy references with no matching `.hcl` | -| P11 | `lib/secret-scan.sh` via `.woodpecker/secret-scan.yml` | literal secret leaked into a policy HCL (rare copy-paste mistake) — already covers `vault/**/*`, no duplicate step here | - -All four steps are fail-closed — any error blocks merge. The pipeline -pins `hashicorp/vault:1.18.5` (matching `lib/init/nomad/install.sh`); -bumping the runtime version without bumping the CI image is a CI-caught -drift. - -## Common failure modes - -| Symptom in CI logs | Root cause | Fix | -|---|---|---| -| `vault-policy-fmt: … is not formatted — run 'vault policy fmt '` | Trailing whitespace / mixed indent in an HCL file | `vault policy fmt ` locally and re-commit | -| `vault-policy-validate: … failed validation` plus a `policy` error from Vault | Unknown capability (e.g. `"frobnicate"`), unknown stanza, malformed `path` block | Fix the HCL; valid capabilities are `read`, `list`, `create`, `update`, `delete`, `patch`, `sudo`, `deny` | -| `vault-roles-validate: ERROR: role 'X' references policy 'Y' but vault/policies/Y.hcl does not exist` | A role's `policy:` field does not match any file basename in `vault/policies/` | Either add the missing policy HCL or fix the typo in `roles.yaml` | -| `vault-roles-validate: ERROR: role entry missing required field 'Z'` | A role in `roles.yaml` is missing one of `name`, `policy`, `namespace`, `job_id` | Add the field; all four are required | -| P11 `secret-scan: detected potential secret …` on a `.hcl` file | A literal token/password was pasted into a policy | Policies must name KV paths, not carry secret values — move the literal into KV (S2.2) and have the policy grant `read` on the path | - ## What this directory does NOT own - **Attaching policies to Nomad jobs.** That's S2.4 (#882) via the @@ -180,3 +125,4 @@ drift. name in `vault { role = "..." }` is what binds the policy. - **Writing the secret values themselves.** That's S2.2 (#880) via `tools/vault-import.sh`. +- **CI policy fmt + validate + roles.yaml check.** That's S2.6 (#884).