fix: [nomad-step-2] S2.3 — vault-nomad-auth.sh (enable JWT auth + roles + nomad workload identity) (#881)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/nomad-validate Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
ci/woodpecker/pr/nomad-validate Pipeline was successful
ci/woodpecker/pr/secret-scan Pipeline was successful

Wires Nomad → Vault via workload identity so jobs can exchange their
short-lived JWT for a Vault token carrying the policies in
vault/policies/ — no shared VAULT_TOKEN in job env.

- `lib/init/nomad/vault-nomad-auth.sh` — idempotent script: enable jwt
  auth at path `jwt-nomad`, config JWKS/algs, apply roles, install
  server.hcl + SIGHUP nomad on change.
- `tools/vault-apply-roles.sh` — companion sync script (S2.1 sibling);
  reads vault/roles.yaml and upserts each Vault role under
  auth/jwt-nomad/role/<name> with created/updated/unchanged semantics.
- `vault/roles.yaml` — declarative role→policy→bound_claims map; one
  entry per vault/policies/*.hcl. Keeps S2.1 policies and S2.3 role
  bindings visible side-by-side at review time.
- `nomad/server.hcl` — adds vault stanza (enabled, address,
  default_identity.aud=["vault.io"], ttl=1h).
- `lib/hvault.sh` — new `hvault_get_or_empty` helper shared between
  vault-apply-policies.sh, vault-apply-roles.sh, and vault-nomad-auth.sh;
  reads a Vault endpoint and distinguishes 200 / 404 / other.
- `vault/policies/AGENTS.md` — extends S2.1 docs with JWT-auth role
  naming convention, token shape, and the "add new service" flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude 2026-04-16 16:44:22 +00:00
parent 88e49b9e9d
commit 8efef9f1bb
7 changed files with 776 additions and 35 deletions

View file

@ -55,12 +55,73 @@ validation.
4. The CI fmt + validate step lands in S2.6 (#884). Until then
`vault policy fmt <file>` locally is the fastest sanity check.
## JWT-auth roles (S2.3)
Policies are inert until a Vault token carrying them is minted. In this
migration that mint path is JWT auth — Nomad jobs exchange their
workload-identity JWT for a Vault token via
`auth/jwt-nomad/role/<name>``token_policies = ["<policy>"]`. The
role bindings live in [`../roles.yaml`](../roles.yaml); the script that
enables the auth method + writes the config + applies roles is
[`lib/init/nomad/vault-nomad-auth.sh`](../../lib/init/nomad/vault-nomad-auth.sh).
The applier is [`tools/vault-apply-roles.sh`](../../tools/vault-apply-roles.sh).
### Role → policy naming convention
Role name == policy name, 1:1. `vault/roles.yaml` carries one entry per
`vault/policies/*.hcl` file:
```yaml
roles:
- name: service-forgejo # Vault role
policy: service-forgejo # ACL policy attached to minted tokens
namespace: default # bound_claims.nomad_namespace
job_id: forgejo # bound_claims.nomad_job_id
```
The role name is what jobspecs reference via `vault { role = "..." }`
keep it identical to the policy basename so an S2.1↔S2.3 drift (new
policy without a role, or vice versa) shows up in one directory review,
not as a runtime "permission denied" at job placement.
`bound_claims.nomad_job_id` is the actual `job "..."` name in the
jobspec, which may differ from the policy name (e.g. policy
`service-forgejo` binds to job `forgejo`). Update it when each bot's or
runner's jobspec lands.
### Adding a new service
1. Write `vault/policies/<name>.hcl` using the naming-table family that
fits (`service-`, `bot-`, `runner-`, or standalone).
2. Add a matching entry to `vault/roles.yaml` with all four fields
(`name`, `policy`, `namespace`, `job_id`).
3. Apply both — either in one shot via `lib/init/nomad/vault-nomad-auth.sh`
(policies → roles → nomad SIGHUP), or granularly via
`tools/vault-apply-policies.sh` + `tools/vault-apply-roles.sh`.
4. Reference the role in the consuming jobspec's `vault { role = "<name>" }`.
### Token shape
All roles share the same token shape, hardcoded in
`tools/vault-apply-roles.sh`:
| Field | Value |
|---|---|
| `bound_audiences` | `["vault.io"]` — matches `default_identity.aud` in `nomad/server.hcl` |
| `token_type` | `service` — auto-revoked when the task exits |
| `token_ttl` | `1h` |
| `token_max_ttl` | `24h` |
Bumping any of these is a knowing, repo-wide change. Per-role overrides
would let one service's tokens outlive the others — add a field to
`vault/roles.yaml` and the applier at the same time if that ever
becomes necessary.
## What this directory does NOT own
- **Attaching policies to Nomad jobs.** That's S2.4 (#882) via the
jobspec `template { vault { policies = […] } }` stanza.
- **Enabling JWT auth + Nomad workload identity roles.** That's S2.3
(#881).
jobspec `template { vault { policies = […] } }` stanza — the role
name in `vault { role = "..." }` is what binds the policy.
- **Writing the secret values themselves.** That's S2.2 (#880) via
`tools/vault-import.sh`.
- **CI policy fmt + validate + roles.yaml check.** That's S2.6 (#884).

150
vault/roles.yaml Normal file
View file

@ -0,0 +1,150 @@
# =============================================================================
# vault/roles.yaml — Vault JWT-auth role bindings for Nomad workload identity
#
# Part of the Nomad+Vault migration (S2.3, issue #881). One entry per
# vault/policies/*.hcl policy. Each entry pairs:
#
# - the Vault role name (what a Nomad job references via
# `vault { role = "..." }` in its jobspec), with
# - the ACL policy attached to tokens it mints, and
# - the bound claims that gate which Nomad workloads may authenticate
# through that role (prevents a jobspec named "woodpecker" from
# asking for role "service-forgejo").
#
# The source of truth for *what* secrets each role's token can read is
# vault/policies/<policy>.hcl. This file only wires role→policy→claims.
# Keeping the two side-by-side in the repo means an S2.1↔S2.3 drift
# (new policy without a role, or vice versa) shows up in one directory
# review, not as a runtime "permission denied" at job placement.
#
# All roles share the same constants (hardcoded in tools/vault-apply-roles.sh):
# - bound_audiences = ["vault.io"] — Nomad's default workload-identity aud
# - token_type = "service" — revoked when task exits
# - token_ttl = "1h" — token lifetime
# - token_max_ttl = "24h" — hard cap across renewals
#
# Format (strict — parsed line-by-line by tools/vault-apply-roles.sh with
# awk; keep the "- name:" prefix + two-space nested indent exactly as
# shown below):
#
# roles:
# - name: <vault-role-name> # path: auth/jwt-nomad/role/<name>
# policy: <acl-policy-name> # must match vault/policies/<name>.hcl
# namespace: <nomad-namespace> # bound_claims.nomad_namespace
# job_id: <nomad-job-id> # bound_claims.nomad_job_id
#
# All four fields are required. Comments (#) and blank lines are ignored.
#
# Adding a new role:
# 1. Land the companion vault/policies/<name>.hcl in S2.1 style.
# 2. Add a block here with all four fields.
# 3. Run tools/vault-apply-roles.sh to upsert it.
# 4. Re-run to confirm "role <name> unchanged".
# =============================================================================
roles:
# ── Long-running services (nomad/jobs/<name>.hcl) ──────────────────────────
# The jobspec's nomad job name is the bound job_id, e.g. `job "forgejo"`
# in nomad/jobs/forgejo.hcl → job_id: forgejo. The policy name stays
# `service-<name>` so the directory layout under vault/policies/ groups
# platform services under a single prefix.
- name: service-forgejo
policy: service-forgejo
namespace: default
job_id: forgejo
- name: service-woodpecker
policy: service-woodpecker
namespace: default
job_id: woodpecker
# ── Per-agent bots (nomad/jobs/bot-<role>.hcl — land in later steps) ───────
# job_id placeholders match the policy name 1:1 until each bot's jobspec
# lands. When a bot's jobspec is added under nomad/jobs/, update the
# corresponding job_id here to match the jobspec's `job "<name>"` — and
# CI's S2.6 roles.yaml check will confirm the pairing.
- name: bot-dev
policy: bot-dev
namespace: default
job_id: bot-dev
- name: bot-dev-qwen
policy: bot-dev-qwen
namespace: default
job_id: bot-dev-qwen
- name: bot-review
policy: bot-review
namespace: default
job_id: bot-review
- name: bot-gardener
policy: bot-gardener
namespace: default
job_id: bot-gardener
- name: bot-planner
policy: bot-planner
namespace: default
job_id: bot-planner
- name: bot-predictor
policy: bot-predictor
namespace: default
job_id: bot-predictor
- name: bot-supervisor
policy: bot-supervisor
namespace: default
job_id: bot-supervisor
- name: bot-architect
policy: bot-architect
namespace: default
job_id: bot-architect
- name: bot-vault
policy: bot-vault
namespace: default
job_id: bot-vault
# ── Edge dispatcher ────────────────────────────────────────────────────────
- name: dispatcher
policy: dispatcher
namespace: default
job_id: dispatcher
# ── Per-secret runner roles ────────────────────────────────────────────────
# vault-runner (Step 5) composes runner-<NAME> policies onto each
# ephemeral dispatch token based on the action TOML's `secrets = [...]`.
# The per-dispatch runner jobspec job_id follows the same `runner-<NAME>`
# convention (one jobspec per secret, minted per dispatch) so the bound
# claim matches the role name directly.
- name: runner-GITHUB_TOKEN
policy: runner-GITHUB_TOKEN
namespace: default
job_id: runner-GITHUB_TOKEN
- name: runner-CODEBERG_TOKEN
policy: runner-CODEBERG_TOKEN
namespace: default
job_id: runner-CODEBERG_TOKEN
- name: runner-CLAWHUB_TOKEN
policy: runner-CLAWHUB_TOKEN
namespace: default
job_id: runner-CLAWHUB_TOKEN
- name: runner-DEPLOY_KEY
policy: runner-DEPLOY_KEY
namespace: default
job_id: runner-DEPLOY_KEY
- name: runner-NPM_TOKEN
policy: runner-NPM_TOKEN
namespace: default
job_id: runner-NPM_TOKEN
- name: runner-DOCKER_HUB_TOKEN
policy: runner-DOCKER_HUB_TOKEN
namespace: default
job_id: runner-DOCKER_HUB_TOKEN