fix: [nomad-step-2] S2.3 — vault-nomad-auth.sh (enable JWT auth + roles + nomad workload identity) (#881)

Wires Nomad → Vault via workload identity so jobs can exchange their short-lived JWT for a Vault token carrying the policies in vault/policies/ — no shared VAULT_TOKEN in job env. - `lib/init/nomad/vault-nomad-auth.sh` — idempotent script: enable jwt auth at path `jwt-nomad`, config JWKS/algs, apply roles, install server.hcl + SIGHUP nomad on change. - `tools/vault-apply-roles.sh` — companion sync script (S2.1 sibling); reads vault/roles.yaml and upserts each Vault role under auth/jwt-nomad/role/<name> with created/updated/unchanged semantics. - `vault/roles.yaml` — declarative role→policy→bound_claims map; one entry per vault/policies/*.hcl. Keeps S2.1 policies and S2.3 role bindings visible side-by-side at review time. - `nomad/server.hcl` — adds vault stanza (enabled, address, default_identity.aud=["vault.io"], ttl=1h). - `lib/hvault.sh` — new `hvault_get_or_empty` helper shared between vault-apply-policies.sh, vault-apply-roles.sh, and vault-nomad-auth.sh; reads a Vault endpoint and distinguishes 200 / 404 / other. - `vault/policies/AGENTS.md` — extends S2.1 docs with JWT-auth role naming convention, token shape, and the "add new service" flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:44:22 +00:00 · 2026-04-16 16:44:22 +00:00 · 8efef9f1bb
commit 8efef9f1bb
parent 88e49b9e9d
7 changed files with 776 additions and 35 deletions
--- a/vault/policies/AGENTS.md
+++ b/vault/policies/AGENTS.md
@ -55,12 +55,73 @@ validation.
 4. The CI fmt + validate step lands in S2.6 (#884). Until then
   `vault policy fmt <file>` locally is the fastest sanity check.

+## JWT-auth roles (S2.3)
+
+Policies are inert until a Vault token carrying them is minted. In this
+migration that mint path is JWT auth — Nomad jobs exchange their
+workload-identity JWT for a Vault token via
+`auth/jwt-nomad/role/<name>` → `token_policies = ["<policy>"]`. The
+role bindings live in [`../roles.yaml`](../roles.yaml); the script that
+enables the auth method + writes the config + applies roles is
+[`lib/init/nomad/vault-nomad-auth.sh`](../../lib/init/nomad/vault-nomad-auth.sh).
+The applier is [`tools/vault-apply-roles.sh`](../../tools/vault-apply-roles.sh).
+
+### Role → policy naming convention
+
+Role name == policy name, 1:1. `vault/roles.yaml` carries one entry per
+`vault/policies/*.hcl` file:
+
+```yaml
+roles:
+  - name:      service-forgejo      # Vault role
+    policy:    service-forgejo      # ACL policy attached to minted tokens
+    namespace: default              # bound_claims.nomad_namespace
+    job_id:    forgejo              # bound_claims.nomad_job_id
+```
+
+The role name is what jobspecs reference via `vault { role = "..." }` —
+keep it identical to the policy basename so an S2.1↔S2.3 drift (new
+policy without a role, or vice versa) shows up in one directory review,
+not as a runtime "permission denied" at job placement.
+
+`bound_claims.nomad_job_id` is the actual `job "..."` name in the
+jobspec, which may differ from the policy name (e.g. policy
+`service-forgejo` binds to job `forgejo`). Update it when each bot's or
+runner's jobspec lands.
+
+### Adding a new service
+
+1. Write `vault/policies/<name>.hcl` using the naming-table family that
+   fits (`service-`, `bot-`, `runner-`, or standalone).
+2. Add a matching entry to `vault/roles.yaml` with all four fields
+   (`name`, `policy`, `namespace`, `job_id`).
+3. Apply both — either in one shot via `lib/init/nomad/vault-nomad-auth.sh`
+   (policies → roles → nomad SIGHUP), or granularly via
+   `tools/vault-apply-policies.sh` + `tools/vault-apply-roles.sh`.
+4. Reference the role in the consuming jobspec's `vault { role = "<name>" }`.
+
+### Token shape
+
+All roles share the same token shape, hardcoded in
+`tools/vault-apply-roles.sh`:
+
+| Field | Value |
+|---|---|
+| `bound_audiences` | `["vault.io"]` — matches `default_identity.aud` in `nomad/server.hcl` |
+| `token_type` | `service` — auto-revoked when the task exits |
+| `token_ttl` | `1h` |
+| `token_max_ttl` | `24h` |
+
+Bumping any of these is a knowing, repo-wide change. Per-role overrides
+would let one service's tokens outlive the others — add a field to
+`vault/roles.yaml` and the applier at the same time if that ever
+becomes necessary.
+
 ## What this directory does NOT own

 - **Attaching policies to Nomad jobs.** That's S2.4 (#882) via the
-  jobspec `template { vault { policies = […] } }` stanza.
- **Enabling JWT auth + Nomad workload identity roles.** That's S2.3
-  (#881).
+  jobspec `template { vault { policies = […] } }` stanza — the role
+  name in `vault { role = "..." }` is what binds the policy.
 - **Writing the secret values themselves.** That's S2.2 (#880) via
  `tools/vault-import.sh`.
 - **CI policy fmt + validate + roles.yaml check.** That's S2.6 (#884).
--- a/vault/roles.yaml
+++ b/vault/roles.yaml
@ -0,0 +1,150 @@
+# =============================================================================
+# vault/roles.yaml — Vault JWT-auth role bindings for Nomad workload identity
+#
+# Part of the Nomad+Vault migration (S2.3, issue #881). One entry per
+# vault/policies/*.hcl policy. Each entry pairs:
+#
+#   - the Vault role name (what a Nomad job references via
+#     `vault { role = "..." }` in its jobspec), with
+#   - the ACL policy attached to tokens it mints, and
+#   - the bound claims that gate which Nomad workloads may authenticate
+#     through that role (prevents a jobspec named "woodpecker" from
+#     asking for role "service-forgejo").
+#
+# The source of truth for *what* secrets each role's token can read is
+# vault/policies/<policy>.hcl. This file only wires role→policy→claims.
+# Keeping the two side-by-side in the repo means an S2.1↔S2.3 drift
+# (new policy without a role, or vice versa) shows up in one directory
+# review, not as a runtime "permission denied" at job placement.
+#
+# All roles share the same constants (hardcoded in tools/vault-apply-roles.sh):
+#   - bound_audiences = ["vault.io"]      — Nomad's default workload-identity aud
+#   - token_type      = "service"         — revoked when task exits
+#   - token_ttl       = "1h"              — token lifetime
+#   - token_max_ttl   = "24h"             — hard cap across renewals
+#
+# Format (strict — parsed line-by-line by tools/vault-apply-roles.sh with
+# awk; keep the "- name:" prefix + two-space nested indent exactly as
+# shown below):
+#
+#   roles:
+#     - name:      <vault-role-name>    # path: auth/jwt-nomad/role/<name>
+#       policy:    <acl-policy-name>    # must match vault/policies/<name>.hcl
+#       namespace: <nomad-namespace>    # bound_claims.nomad_namespace
+#       job_id:    <nomad-job-id>       # bound_claims.nomad_job_id
+#
+# All four fields are required. Comments (#) and blank lines are ignored.
+#
+# Adding a new role:
+#   1. Land the companion vault/policies/<name>.hcl in S2.1 style.
+#   2. Add a block here with all four fields.
+#   3. Run tools/vault-apply-roles.sh to upsert it.
+#   4. Re-run to confirm "role <name> unchanged".
+# =============================================================================
+roles:
+  # ── Long-running services (nomad/jobs/<name>.hcl) ──────────────────────────
+  # The jobspec's nomad job name is the bound job_id, e.g. `job "forgejo"`
+  # in nomad/jobs/forgejo.hcl → job_id: forgejo. The policy name stays
+  # `service-<name>` so the directory layout under vault/policies/ groups
+  # platform services under a single prefix.
+  - name:      service-forgejo
+    policy:    service-forgejo
+    namespace: default
+    job_id:    forgejo
+
+  - name:      service-woodpecker
+    policy:    service-woodpecker
+    namespace: default
+    job_id:    woodpecker
+
+  # ── Per-agent bots (nomad/jobs/bot-<role>.hcl — land in later steps) ───────
+  # job_id placeholders match the policy name 1:1 until each bot's jobspec
+  # lands. When a bot's jobspec is added under nomad/jobs/, update the
+  # corresponding job_id here to match the jobspec's `job "<name>"` — and
+  # CI's S2.6 roles.yaml check will confirm the pairing.
+  - name:      bot-dev
+    policy:    bot-dev
+    namespace: default
+    job_id:    bot-dev
+
+  - name:      bot-dev-qwen
+    policy:    bot-dev-qwen
+    namespace: default
+    job_id:    bot-dev-qwen
+
+  - name:      bot-review
+    policy:    bot-review
+    namespace: default
+    job_id:    bot-review
+
+  - name:      bot-gardener
+    policy:    bot-gardener
+    namespace: default
+    job_id:    bot-gardener
+
+  - name:      bot-planner
+    policy:    bot-planner
+    namespace: default
+    job_id:    bot-planner
+
+  - name:      bot-predictor
+    policy:    bot-predictor
+    namespace: default
+    job_id:    bot-predictor
+
+  - name:      bot-supervisor
+    policy:    bot-supervisor
+    namespace: default
+    job_id:    bot-supervisor
+
+  - name:      bot-architect
+    policy:    bot-architect
+    namespace: default
+    job_id:    bot-architect
+
+  - name:      bot-vault
+    policy:    bot-vault
+    namespace: default
+    job_id:    bot-vault
+
+  # ── Edge dispatcher ────────────────────────────────────────────────────────
+  - name:      dispatcher
+    policy:    dispatcher
+    namespace: default
+    job_id:    dispatcher
+
+  # ── Per-secret runner roles ────────────────────────────────────────────────
+  # vault-runner (Step 5) composes runner-<NAME> policies onto each
+  # ephemeral dispatch token based on the action TOML's `secrets = [...]`.
+  # The per-dispatch runner jobspec job_id follows the same `runner-<NAME>`
+  # convention (one jobspec per secret, minted per dispatch) so the bound
+  # claim matches the role name directly.
+  - name:      runner-GITHUB_TOKEN
+    policy:    runner-GITHUB_TOKEN
+    namespace: default
+    job_id:    runner-GITHUB_TOKEN
+
+  - name:      runner-CODEBERG_TOKEN
+    policy:    runner-CODEBERG_TOKEN
+    namespace: default
+    job_id:    runner-CODEBERG_TOKEN
+
+  - name:      runner-CLAWHUB_TOKEN
+    policy:    runner-CLAWHUB_TOKEN
+    namespace: default
+    job_id:    runner-CLAWHUB_TOKEN
+
+  - name:      runner-DEPLOY_KEY
+    policy:    runner-DEPLOY_KEY
+    namespace: default
+    job_id:    runner-DEPLOY_KEY
+
+  - name:      runner-NPM_TOKEN
+    policy:    runner-NPM_TOKEN
+    namespace: default
+    job_id:    runner-NPM_TOKEN
+
+  - name:      runner-DOCKER_HUB_TOKEN
+    policy:    runner-DOCKER_HUB_TOKEN
+    namespace: default
+    job_id:    runner-DOCKER_HUB_TOKEN