diff --git a/sprints/nomad-dispatcher-cutover.md b/sprints/nomad-dispatcher-cutover.md new file mode 100644 index 0000000..52b49ed --- /dev/null +++ b/sprints/nomad-dispatcher-cutover.md @@ -0,0 +1,132 @@ +# Sprint: nomad-dispatcher-cutover + +## Vision issues +- #981 — vision: [nomad-step-5] S5 — implement dispatcher Nomad backend + retire docker-compose dispatch + +## What this enables +The edge dispatcher can launch vault-runner and sidecar (reproduce/triage/verify) jobs via Nomad instead of `docker run`. This completes the Nomad migration: every workload runs under Nomad's scheduler with Vault-managed secrets, enabling proper resource limits, restart policies, and audit trails. After cutover, the docker-compose dispatch path is retired — the dispatcher no longer needs the Docker socket. + +## What exists today +- **`_launch_runner_nomad()`** (dispatcher.sh:561-725) — substantially implemented: dispatches via `nomad job dispatch -detach`, polls allocation state, extracts exit code and logs. Needs validation, not greenfield. +- **`vault-runner.hcl`** — parameterized batch jobspec with pre-templated secrets (6 runner secrets via `error_on_missing_key=false` fallback). Ready for dispatch. +- **`edge.hcl`** — dispatcher service task with `DISPATCHER_BACKEND=nomad` and `service-dispatcher` Vault role. Deployed. +- **Per-secret Vault policies and roles** — `runner-GITHUB_TOKEN`, `runner-CODEBERG_TOKEN`, etc. exist in `vault/policies/` and `vault/roles.yaml`. +- **Docker backend** — `_launch_runner_docker()` and `_dispatch_sidecar_docker()` fully working as the production dispatch path. +- **`_dispatch_sidecar_nomad()`** (dispatcher.sh:842-848) — pure stub, returns 1. +- **No sidecar jobspec** — no Nomad equivalent of the reproduce/triage/verify containers. +- **Mounts (ssh/gpg/sops)** — handled by Docker backend via bind mounts; `mounts_csv` is passed but ignored in Nomad path. + +## Complexity +- **Primary files:** dispatcher.sh (~1300 lines, 2 functions to implement/fix), 1 new jobspec (sidecar batch), vault-runner.hcl (mount additions), vault policy composition logic +- **Subsystems touched:** dispatcher, Nomad jobspecs, Vault policies, `bin/disinto` wiring, deploy.sh +- **Estimated sub-issues:** 6 +- **Ratio:** ~80% gluecode (wiring existing Nomad/Vault primitives), ~20% new logic (sidecar jobspec, policy composition, cutover gate) + +## Risks +- **Silent secret drop:** Nomad template renders missing secrets as empty strings — a dispatched runner could silently operate without credentials it needs. The docker path fails loudly via `load_secret()`. +- **Sidecar lifecycle mismatch:** Docker sidecars run as background processes tracked by PID. Nomad batch jobs have different lifecycle semantics (no PID, allocation-based tracking). The polling loops in reproduce/triage candidate selection must adapt. +- **Policy composition race:** If the dispatcher's Vault token lacks permission to attach per-dispatch policies, every nomad dispatch fails. This is a new capability the `service-dispatcher` role doesn't currently grant. +- **Mount mapping:** Docker bind-mounts (docker.sock, ssh keys, gpg) don't map 1:1 to Nomad host volumes. Missing mounts = broken formulas that need credentials. +- **Rollback gap:** If nomad dispatch breaks in production, there's no automatic fallback to docker unless explicitly coded. + +## Cost — new infra to maintain +- **1 new jobspec:** sidecar batch job (reproduce/triage/verify) — parameterized, analogous to vault-runner.hcl +- **Policy composition logic** in dispatcher — new code path that must stay in sync with `vault/policies/runner-*.hcl` +- **No new services or scheduled tasks** — uses existing Nomad cluster, Vault instance, and dispatcher polling loop +- **Host volume declarations** for ssh/gpg/sops on Nomad clients (if formulas require them) + +## Recommendation +Worth it. This is the final step in the Nomad migration (S1-S4 complete). Most infrastructure exists — vault-runner.hcl, edge.hcl, per-secret policies, and the nomad launcher function are already landed. The work is predominantly wiring and validation. Deferring leaves the dispatcher dependent on the Docker socket, which contradicts the Nomad migration's security and scheduling goals. The sidecar jobspec is the only greenfield piece; everything else is completing stubs or adding integration tests. + +## Sub-issues + + +- id: s5-nomad-policy-composition + title: "vision(#981): dispatcher composes per-dispatch Vault policies for nomad runner" + labels: [backlog] + depends_on: [] + body: | + ## Goal + The dispatcher dynamically attaches only the Vault policies required by each action's `secrets = [...]` list when dispatching a `vault-runner` batch job, so Nomad-dispatched runners receive scoped secret access. + + ## Acceptance criteria + - [ ] Dispatcher reads `secrets` field from action TOML and maps each to its `runner-` Vault policy + - [ ] `service-dispatcher` Vault policy updated to allow token creation with runner-* policies + - [ ] Dispatched vault-runner job receives a Vault token scoped to only the requested secrets + - [ ] Dispatch with empty `secrets = []` succeeds (runner gets no secret policies) + - [ ] Dispatch with unknown secret name logs error and fails before launch + +- id: s5-nomad-sidecar-jobspec + title: "vision(#981): add parameterized batch jobspec for reproduce/triage/verify sidecars" + labels: [backlog] + depends_on: [] + body: | + ## Goal + Create a Nomad parameterized batch jobspec for sidecar containers (reproduce, triage, verify) analogous to vault-runner.hcl, so `_dispatch_sidecar_nomad()` has a job to dispatch. + + ## Acceptance criteria + - [ ] `nomad/jobs/sidecar.hcl` exists as a parameterized batch job accepting `issue_number`, `formula`, and `project_toml` meta + - [ ] Jobspec mounts required host volumes (project-repos, ops-repo, agent-data) + - [ ] Vault template renders bot token for sidecar identity (e.g. `kv/data/disinto/bots/dev`) + - [ ] `nomad job validate nomad/jobs/sidecar.hcl` passes + - [ ] `bin/disinto --with edge` deploys the sidecar job alongside other jobs + +- id: s5-implement-dispatch-sidecar-nomad + title: "vision(#981): implement _dispatch_sidecar_nomad() in dispatcher.sh" + labels: [backlog] + depends_on: [s5-nomad-sidecar-jobspec] + body: | + ## Goal + Replace the stub `_dispatch_sidecar_nomad()` with a working implementation that dispatches reproduce/triage/verify jobs via Nomad and tracks their allocation lifecycle. + + ## Acceptance criteria + - [ ] `_dispatch_sidecar_nomad()` dispatches `sidecar` parameterized batch job with correct meta + - [ ] Dispatcher tracks sidecar allocations instead of PIDs for the Nomad backend + - [ ] Reproduce/triage/verify polling loops correctly detect sidecar completion via Nomad allocation state + - [ ] Sidecar logs are retrievable via `nomad alloc logs` and included in dispatcher logging + +- id: s5-nomad-runner-mounts + title: "vision(#981): support ssh/gpg/sops mount aliases in Nomad dispatch path" + labels: [backlog] + depends_on: [] + body: | + ## Goal + Action TOMLs with `mounts = ["ssh", "gpg", "sops"]` work correctly when dispatched via the Nomad backend, matching the Docker backend's bind-mount behavior. + + ## Acceptance criteria + - [ ] Nomad client config declares host volumes for ssh, gpg, and sops credential paths + - [ ] `vault-runner.hcl` declares optional volume mounts for each alias + - [ ] `_launch_runner_nomad()` passes mount requirements as dispatch meta + - [ ] Runner container receives mounted credentials at expected paths + - [ ] Dispatch without mounts succeeds (no volume mount errors) + +- id: s5-nomad-dispatch-smoke-test + title: "vision(#981): end-to-end smoke test for Nomad dispatch path" + labels: [backlog] + depends_on: [s5-nomad-policy-composition, s5-implement-dispatch-sidecar-nomad, s5-nomad-runner-mounts] + body: | + ## Goal + Validate the full Nomad dispatch path works end-to-end: action TOML to dispatcher to nomad job dispatch to runner execution to result.json written back to ops repo. + + ## Acceptance criteria + - [ ] Test action TOML dispatched with `DISPATCHER_BACKEND=nomad` completes successfully + - [ ] Runner receives correct secrets via Vault template (non-empty for granted, empty for ungranted) + - [ ] Result JSON written to ops repo with correct exit code and log excerpt + - [ ] Sidecar dispatch (reproduce formula) launches and completes via Nomad + - [ ] Failure case: action with invalid formula produces exit code != 0 in result + +- id: s5-retire-docker-compose-dispatch + title: "vision(#981): retire docker-compose dispatch path and remove Docker socket dependency" + labels: [backlog] + depends_on: [s5-nomad-dispatch-smoke-test] + body: | + ## Goal + Remove the docker-compose dispatch backend from the dispatcher after Nomad path is validated, eliminating the Docker socket mount from the edge container. + + ## Acceptance criteria + - [ ] `_launch_runner_docker()` and `_dispatch_sidecar_docker()` removed from dispatcher.sh + - [ ] `DISPATCHER_BACKEND` env var removed; Nomad is the only path + - [ ] `edge.hcl` no longer mounts Docker socket + - [ ] `docker-compose.yml` edge service no longer mounts Docker socket (if compose is retained for dev) + - [ ] Dispatcher startup validates Nomad connectivity (nomad status) before entering poll loop +