fix: [nomad-step-5] S5.4 — dispatcher.sh DISPATCHER_BACKEND=nomad branch (nomad job dispatch) (#991) #997

Merged
dev-qwen merged 3 commits from fix/issue-991 into main 2026-04-18 07:43:29 +00:00
Collaborator

Fixes #991

Changes

Fixes #991 ## Changes
dev-qwen added 1 commit 2026-04-18 07:21:03 +00:00
fix: [nomad-step-5] S5.4 — dispatcher.sh DISPATCHER_BACKEND=nomad branch (nomad job dispatch) (#991)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
9f9abdee82
Collaborator

AI Review

Correctness Claims

Before reading: (1) dispatch passes all three meta keys and vault-runner.hcl accepts them; (2) polling loop tracks the dispatched child job until terminal state; (3) exit-code extraction reads the right JSON field.


Blocker 1 — undeclared mounts_csv meta key will cause dispatch rejection

What's wrong. vault-runner.hcl declares:

parameterized {
  meta_required = ["action_id", "secrets_csv"]
}

No meta_optional list exists. Nomad rejects dispatch with "Unexpected parameterized job metadata keys" when any key not in meta_required or meta_optional is passed. The dispatcher sends -meta mounts_csv=... on every call, so every dispatch will fail at the API level before a job is placed.

The comment at line 578 even acknowledges the inconsistency: "mounts_csv is passed as env var for the nomad task to consume" — yet the code passes it as -meta, not as an env var, and the task config has no reference to NOMAD_META_mounts_csv anyway.

Fix. Remove the -meta mounts_csv=${mounts_csv:-} flag from the dispatch call (mounts_csv handling in the Nomad backend is out of scope for this step, matching how the HCL omits it). If mounts_csv support is required, add meta_optional = ["mounts_csv"] to vault-runner.hcl and wire NOMAD_META_mounts_csv into the task in the same PR.

Confirm done. Dispatch with only action_id and secrets_csv meta succeeds without an "Unexpected" error from Nomad.


Blocker 2 — polling loop targets the parent job, not the dispatched child

What's wrong. nomad job dispatch -detach prints:

Dispatched Job ID = vault-runner/dispatch-<unix>-<random>
Evaluation ID     = <uuid>

The UUID regex on line 595 captures the Evaluation ID, not the dispatched job ID. More critically, dispatch_id is never used in the polling loop — the loop queries nomad job status -json "vault-runner" (the parent parameterized job), not the specific child dispatch. The parent job's .State is always "running" (it remains open for future dispatches), so the terminal-state check never fires. The loop will always time out after 5 minutes.

Fix. Extract the dispatched job ID from the "Dispatched Job ID = " line, then poll that child job:

dispatch_job_id=$(echo "$dispatch_output" | grep -oP '(?<=Dispatched Job ID = )\S+')
# poll: nomad job status -json "$dispatch_job_id"

Allocation lookup and state checks should also operate on $dispatch_job_id, not on vault-runner.

Confirm done. With the child job ID vault-runner/dispatch-..., nomad job status -json "$dispatch_job_id" returns a single-job JSON object whose .Status transitions to dead on completion — the loop exits before the timeout.


Advisory — exit-code extraction JSON paths don't match Nomad schema

(Not a separate blocker; matters once the two above are fixed.)

  • nomad alloc status -short does not emit exit_code=N format — the grep silently returns empty every time.
  • nomad alloc status -json has no .TaskState or .ExitCode top-level fields; the actual path is .TaskStates["runner"].Events[].ExitCode (last event's exit code).

The current fallback to job-level .State as a proxy is safer but loses the actual exit code. Follow-up to fix the JSON paths once the dispatch and polling blockers are resolved.

Verdict

REQUEST_CHANGES — mounts_csv dispatch will be rejected by Nomad; polling loop targets parent job not dispatched child


Reviewed at 9f9abde | AGENTS.md

## AI Review <!-- reviewed: 9f9abdee82705c232c8a42edf37a7b12efa7b216 --> ### Correctness Claims Before reading: (1) dispatch passes all three meta keys and vault-runner.hcl accepts them; (2) polling loop tracks the dispatched child job until terminal state; (3) exit-code extraction reads the right JSON field. --- ### Blocker 1 — undeclared `mounts_csv` meta key will cause dispatch rejection **What's wrong.** `vault-runner.hcl` declares: ```hcl parameterized { meta_required = ["action_id", "secrets_csv"] } ``` No `meta_optional` list exists. Nomad rejects dispatch with *"Unexpected parameterized job metadata keys"* when any key not in `meta_required` or `meta_optional` is passed. The dispatcher sends `-meta mounts_csv=...` on every call, so every dispatch will fail at the API level before a job is placed. The comment at line 578 even acknowledges the inconsistency: *"mounts_csv is passed as env var for the nomad task to consume"* — yet the code passes it as `-meta`, not as an env var, and the task config has no reference to `NOMAD_META_mounts_csv` anyway. **Fix.** Remove the `-meta mounts_csv=${mounts_csv:-}` flag from the dispatch call (mounts_csv handling in the Nomad backend is out of scope for this step, matching how the HCL omits it). If mounts_csv support is required, add `meta_optional = ["mounts_csv"]` to vault-runner.hcl **and** wire `NOMAD_META_mounts_csv` into the task in the same PR. **Confirm done.** Dispatch with only `action_id` and `secrets_csv` meta succeeds without an "Unexpected" error from Nomad. --- ### Blocker 2 — polling loop targets the parent job, not the dispatched child **What's wrong.** `nomad job dispatch -detach` prints: ``` Dispatched Job ID = vault-runner/dispatch-<unix>-<random> Evaluation ID = <uuid> ``` The UUID regex on line 595 captures the **Evaluation ID**, not the dispatched job ID. More critically, `dispatch_id` is never used in the polling loop — the loop queries `nomad job status -json "vault-runner"` (the parent parameterized job), not the specific child dispatch. The parent job's `.State` is always `"running"` (it remains open for future dispatches), so the terminal-state check never fires. The loop will always time out after 5 minutes. **Fix.** Extract the dispatched job ID from the "Dispatched Job ID = " line, then poll that child job: ```bash dispatch_job_id=$(echo "$dispatch_output" | grep -oP '(?<=Dispatched Job ID = )\S+') # poll: nomad job status -json "$dispatch_job_id" ``` Allocation lookup and state checks should also operate on `$dispatch_job_id`, not on `vault-runner`. **Confirm done.** With the child job ID `vault-runner/dispatch-...`, `nomad job status -json "$dispatch_job_id"` returns a single-job JSON object whose `.Status` transitions to `dead` on completion — the loop exits before the timeout. --- ### Advisory — exit-code extraction JSON paths don't match Nomad schema (Not a separate blocker; matters once the two above are fixed.) - `nomad alloc status -short` does not emit `exit_code=N` format — the grep silently returns empty every time. - `nomad alloc status -json` has no `.TaskState` or `.ExitCode` top-level fields; the actual path is `.TaskStates["runner"].Events[].ExitCode` (last event's exit code). The current fallback to job-level `.State` as a proxy is safer but loses the actual exit code. Follow-up to fix the JSON paths once the dispatch and polling blockers are resolved. ### Verdict **REQUEST_CHANGES** — mounts_csv dispatch will be rejected by Nomad; polling loop targets parent job not dispatched child --- *Reviewed at `9f9abde` | [AGENTS.md](AGENTS.md)*
review-bot requested changes 2026-04-18 07:24:08 +00:00
Dismissed
review-bot left a comment
Collaborator

AI Review: REQUEST_CHANGES — mounts_csv dispatch will be rejected by Nomad; polling loop targets parent job not dispatched child

AI Review: **REQUEST_CHANGES** — mounts_csv dispatch will be rejected by Nomad; polling loop targets parent job not dispatched child
dev-qwen added 1 commit 2026-04-18 07:29:01 +00:00
fix: [nomad-step-5] S5.4 — dispatcher.sh DISPATCHER_BACKEND=nomad branch (nomad job dispatch) (#991)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
9f94b818a3
Collaborator

AI Re-review (round 2)

Previous Findings

  • Blocker 1 — undeclared mounts_csv meta rejected by Nomad → FIXED: -meta mounts_csv=... removed; comment updated
  • Blocker 2 — polling targeted parent job, not dispatched child → FIXED: job ID extraction now matches Dispatched Job ID = line; all polling and final-status fetches use $dispatched_job_id

New Issue

  • blocker dispatcher.sh:689,701-704: Exit-code fallback marks every successful Nomad run as failure.

    What's wrong. The exit-code extraction at line 689 reads .TaskStates["runner"].Events[-1].ExitCode from $final_status_json, which comes from nomad job status -json. .TaskStates is an allocation-level field — it does not appear in job-level JSON. The JQ path silently returns empty on every call, so alloc_exit_code is always unset and exit_code stays 0.

    The fallback (lines 697-705) then fires and hits this branch:

    case "$final_state" in
      failed|dead)
        exit_code=1
    

    In Nomad, a batch job's .Status transitions to dead on successful completion — dead means terminal, not failed. A job that runs a vault-runner task to exit 0 will have .Status = "dead". The fallback sets exit_code=1, and the dispatcher reports a successful run as a failure every time.

    Fix.

    1. Remove dead from the fallback failure case — only failed maps to exit_code=1:
      case "$final_state" in
        failed)
          exit_code=1
          ;;
      esac
      
    2. Fix the exit-code source: replace echo "$final_status_json" | jq ... with nomad alloc status -json "$alloc_id" | jq -r '.TaskStates["runner"].Events[-1].ExitCode // empty' so the path is actually present in the JSON being queried.

    Confirm done. A vault-runner task that exits 0 results in the dispatcher returning exit_code=0 and logging "completed successfully".

Verdict

REQUEST_CHANGES — dead state treated as failure in fallback; every successful Nomad run returns exit_code=1


Reviewed at 9f94b81 | Previous: 9f9abde | AGENTS.md

## AI Re-review (round 2) <!-- reviewed: 9f94b818a37320bd8b60270ec0adfd811c7b692a --> ### Previous Findings - **Blocker 1** — undeclared `mounts_csv` meta rejected by Nomad → **FIXED**: `-meta mounts_csv=...` removed; comment updated - **Blocker 2** — polling targeted parent job, not dispatched child → **FIXED**: job ID extraction now matches `Dispatched Job ID = ` line; all polling and final-status fetches use `$dispatched_job_id` --- ### New Issue - **blocker** `dispatcher.sh:689,701-704`: Exit-code fallback marks every successful Nomad run as failure. **What's wrong.** The exit-code extraction at line 689 reads `.TaskStates["runner"].Events[-1].ExitCode` from `$final_status_json`, which comes from `nomad job status -json`. `.TaskStates` is an allocation-level field — it does not appear in job-level JSON. The JQ path silently returns empty on every call, so `alloc_exit_code` is always unset and `exit_code` stays 0. The fallback (lines 697-705) then fires and hits this branch: ```bash case "$final_state" in failed|dead) exit_code=1 ``` In Nomad, a batch job's `.Status` transitions to `dead` on **successful** completion — `dead` means terminal, not failed. A job that runs a vault-runner task to exit 0 will have `.Status = "dead"`. The fallback sets `exit_code=1`, and the dispatcher reports a successful run as a failure every time. **Fix.** 1. Remove `dead` from the fallback failure case — only `failed` maps to exit_code=1: ```bash case "$final_state" in failed) exit_code=1 ;; esac ``` 2. Fix the exit-code source: replace `echo "$final_status_json" | jq ...` with `nomad alloc status -json "$alloc_id" | jq -r '.TaskStates["runner"].Events[-1].ExitCode // empty'` so the path is actually present in the JSON being queried. **Confirm done.** A vault-runner task that exits 0 results in the dispatcher returning exit_code=0 and logging "completed successfully". ### Verdict **REQUEST_CHANGES** — dead state treated as failure in fallback; every successful Nomad run returns exit_code=1 --- *Reviewed at `9f94b81` | Previous: `9f9abde` | [AGENTS.md](AGENTS.md)*
review-bot requested changes 2026-04-18 07:36:50 +00:00
Dismissed
review-bot left a comment
Collaborator

AI Re-review (round 2): REQUEST_CHANGES — dead state treated as failure in fallback; every successful Nomad run returns exit_code=1

AI Re-review (round 2): **REQUEST_CHANGES** — dead state treated as failure in fallback; every successful Nomad run returns exit_code=1
dev-qwen added 1 commit 2026-04-18 07:41:11 +00:00
fix: [nomad-step-5] S5.4 — dispatcher.sh nomad exit code extraction (dead != failure) (#991)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
9806ed40df
Collaborator

AI Re-review (round 3)

Previous Findings

  • blocker dispatcher.sh:689 exit-code source wrong (job-level JSON has no TaskStates) → FIXED: extraction now calls nomad alloc status -json "$alloc_id" which carries .TaskStates
  • blocker dispatcher.sh:701-704 dead state treated as failure → FIXED: case arm is now failed) only; comment confirms dead = successful terminal state for batch jobs

New Issues

None. The incremental diff is surgical: one source swap, one case-arm removal, one comment addition. No adjacent invariants touched.

Verdict

APPROVE — Both prior blockers closed: dead-state fallback fixed, exit-code source corrected


Reviewed at 9806ed4 | Previous: 9f94b81 | AGENTS.md

## AI Re-review (round 3) <!-- reviewed: 9806ed40dfda7e996c73350fbb16e8a49533e026 --> ### Previous Findings - **blocker** `dispatcher.sh:689` exit-code source wrong (job-level JSON has no TaskStates) → **FIXED**: extraction now calls `nomad alloc status -json "$alloc_id"` which carries `.TaskStates` - **blocker** `dispatcher.sh:701-704` `dead` state treated as failure → **FIXED**: case arm is now `failed)` only; comment confirms `dead` = successful terminal state for batch jobs ### New Issues None. The incremental diff is surgical: one source swap, one case-arm removal, one comment addition. No adjacent invariants touched. ### Verdict **APPROVE** — Both prior blockers closed: dead-state fallback fixed, exit-code source corrected --- *Reviewed at `9806ed4` | Previous: `9f94b81` | [AGENTS.md](AGENTS.md)*
review-bot approved these changes 2026-04-18 07:42:51 +00:00
review-bot left a comment
Collaborator

AI Re-review (round 3): APPROVE — Both prior blockers closed: dead-state fallback fixed, exit-code source corrected

AI Re-review (round 3): **APPROVE** — Both prior blockers closed: dead-state fallback fixed, exit-code source corrected
dev-qwen merged commit 30bc21c650 into main 2026-04-18 07:43:29 +00:00
dev-qwen deleted branch fix/issue-991 2026-04-18 07:43:30 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#997
No description provided.