fix: [nomad-step-5] S5.4 — dispatcher.sh DISPATCHER_BACKEND=nomad branch (nomad job dispatch) (#991) #997
No reviewers
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#997
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/issue-991"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #991
Changes
AI Review
Correctness Claims
Before reading: (1) dispatch passes all three meta keys and vault-runner.hcl accepts them; (2) polling loop tracks the dispatched child job until terminal state; (3) exit-code extraction reads the right JSON field.
Blocker 1 — undeclared
mounts_csvmeta key will cause dispatch rejectionWhat's wrong.
vault-runner.hcldeclares:No
meta_optionallist exists. Nomad rejects dispatch with "Unexpected parameterized job metadata keys" when any key not inmeta_requiredormeta_optionalis passed. The dispatcher sends-meta mounts_csv=...on every call, so every dispatch will fail at the API level before a job is placed.The comment at line 578 even acknowledges the inconsistency: "mounts_csv is passed as env var for the nomad task to consume" — yet the code passes it as
-meta, not as an env var, and the task config has no reference toNOMAD_META_mounts_csvanyway.Fix. Remove the
-meta mounts_csv=${mounts_csv:-}flag from the dispatch call (mounts_csv handling in the Nomad backend is out of scope for this step, matching how the HCL omits it). If mounts_csv support is required, addmeta_optional = ["mounts_csv"]to vault-runner.hcl and wireNOMAD_META_mounts_csvinto the task in the same PR.Confirm done. Dispatch with only
action_idandsecrets_csvmeta succeeds without an "Unexpected" error from Nomad.Blocker 2 — polling loop targets the parent job, not the dispatched child
What's wrong.
nomad job dispatch -detachprints:The UUID regex on line 595 captures the Evaluation ID, not the dispatched job ID. More critically,
dispatch_idis never used in the polling loop — the loop queriesnomad job status -json "vault-runner"(the parent parameterized job), not the specific child dispatch. The parent job's.Stateis always"running"(it remains open for future dispatches), so the terminal-state check never fires. The loop will always time out after 5 minutes.Fix. Extract the dispatched job ID from the "Dispatched Job ID = " line, then poll that child job:
Allocation lookup and state checks should also operate on
$dispatch_job_id, not onvault-runner.Confirm done. With the child job ID
vault-runner/dispatch-...,nomad job status -json "$dispatch_job_id"returns a single-job JSON object whose.Statustransitions todeadon completion — the loop exits before the timeout.Advisory — exit-code extraction JSON paths don't match Nomad schema
(Not a separate blocker; matters once the two above are fixed.)
nomad alloc status -shortdoes not emitexit_code=Nformat — the grep silently returns empty every time.nomad alloc status -jsonhas no.TaskStateor.ExitCodetop-level fields; the actual path is.TaskStates["runner"].Events[].ExitCode(last event's exit code).The current fallback to job-level
.Stateas a proxy is safer but loses the actual exit code. Follow-up to fix the JSON paths once the dispatch and polling blockers are resolved.Verdict
REQUEST_CHANGES — mounts_csv dispatch will be rejected by Nomad; polling loop targets parent job not dispatched child
Reviewed at
9f9abde| AGENTS.mdAI Review: REQUEST_CHANGES — mounts_csv dispatch will be rejected by Nomad; polling loop targets parent job not dispatched child
AI Re-review (round 2)
Previous Findings
mounts_csvmeta rejected by Nomad → FIXED:-meta mounts_csv=...removed; comment updatedDispatched Job ID =line; all polling and final-status fetches use$dispatched_job_idNew Issue
blocker
dispatcher.sh:689,701-704: Exit-code fallback marks every successful Nomad run as failure.What's wrong. The exit-code extraction at line 689 reads
.TaskStates["runner"].Events[-1].ExitCodefrom$final_status_json, which comes fromnomad job status -json..TaskStatesis an allocation-level field — it does not appear in job-level JSON. The JQ path silently returns empty on every call, soalloc_exit_codeis always unset andexit_codestays 0.The fallback (lines 697-705) then fires and hits this branch:
In Nomad, a batch job's
.Statustransitions todeadon successful completion —deadmeans terminal, not failed. A job that runs a vault-runner task to exit 0 will have.Status = "dead". The fallback setsexit_code=1, and the dispatcher reports a successful run as a failure every time.Fix.
deadfrom the fallback failure case — onlyfailedmaps to exit_code=1:echo "$final_status_json" | jq ...withnomad alloc status -json "$alloc_id" | jq -r '.TaskStates["runner"].Events[-1].ExitCode // empty'so the path is actually present in the JSON being queried.Confirm done. A vault-runner task that exits 0 results in the dispatcher returning exit_code=0 and logging "completed successfully".
Verdict
REQUEST_CHANGES — dead state treated as failure in fallback; every successful Nomad run returns exit_code=1
Reviewed at
9f94b81| Previous:9f9abde| AGENTS.mdAI Re-review (round 2): REQUEST_CHANGES — dead state treated as failure in fallback; every successful Nomad run returns exit_code=1
AI Re-review (round 3)
Previous Findings
dispatcher.sh:689exit-code source wrong (job-level JSON has no TaskStates) → FIXED: extraction now callsnomad alloc status -json "$alloc_id"which carries.TaskStatesdispatcher.sh:701-704deadstate treated as failure → FIXED: case arm is nowfailed)only; comment confirmsdead= successful terminal state for batch jobsNew Issues
None. The incremental diff is surgical: one source swap, one case-arm removal, one comment addition. No adjacent invariants touched.
Verdict
APPROVE — Both prior blockers closed: dead-state fallback fixed, exit-code source corrected
Reviewed at
9806ed4| Previous:9f94b81| AGENTS.mdAI Re-review (round 3): APPROVE — Both prior blockers closed: dead-state fallback fixed, exit-code source corrected