fix: [nomad-step-1] deploy.sh-fix — poll deployment status not alloc status; bump timeout 120→240s (#878) #885
No reviewers
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#885
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/issue-878"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #878
Changes
AI Review
Summary
The PR's intent is correct: polling
nomad deployment statusis the right signal for health rather than job-level allocation status. However the implementation has a critical jq bug that makes the core loop non-functional.Blocking Issues
critical
deploy.sh:102:jq -r '.[0].Status'is applied to the output ofnomad deployment status -json <deployment-id>, which returns a single JSON object, not an array. In jq,.[0]on an object returnsnull, sostatusis alwaysnull. Thecasestatement falls to*on every iteration, and the function loops until timeout on every deployment — success and failure are never detected.Fix: change
'.[0].Status'→'.Status'on line 102.high
deploy.sh:85-88: The deployment ID is fetched exactly once at function entry, immediately afternomad job run -detach. If the deployment hasn't yet propagated to Nomad's API (transient delay),deployment_idis empty and the function returns 1 immediately with "no deployment found" — a hard failure rather than a retry. This should loop with a short sleep until the deployment ID appears, or at minimum retry a few times before giving up.Minor Issues
low
deploy.sh:119: The alloc-ID extraction from the deployment JSON uses'.[0].AllocStatus.AllocsNotYetRunning'. Two problems: (1) same.[0]issue on a single object; (2)AllocStatus.AllocsNotYetRunningdoes not exist in Nomad's deployment JSON schema — the deployment object has task group summaries underTaskGroups, not a flat alloc list. This path always returns empty, so diagnostics always fall through to the job-status fallback.low
deploy.sh:124,155:jq -r '.Evaluations[].Allocations[]?.ID'onnomad job status -jsonoutput — allocations in the job status JSON are at.Allocations[](top-level array), not nested under evaluations. This also silently yields nothing, meaning the failure/timeout log dump produces no alloc output.Non-blocking Notes
current_status = "running"(job-level status) while the log says "already healthy". This is intentional and correct — job status "running" is the right idempotency signal; no change needed.JOB_READY_TIMEOUT_<JOBNAME>) is a clean, useful addition.Verdict
REQUEST_CHANGES — jq '.[0].Status' on a single-object response always returns null — every deployment will time out instead of detecting success/failure
Reviewed at
dee05d2| AGENTS.mdAI Review: REQUEST_CHANGES — jq '.[0].Status' on a single-object response always returns null — every deployment will time out instead of detecting success/failure
AI Re-review (round 2)
Previous Findings
deploy.sh:108—jq -r '.[0].Status'on single deployment object → always null → FIXED: changed to.Statusat line 111 ✓deploy.sh:85-88— deployment ID fetched once with no retry → FIXED: retry loop withmax_retries=12,sleep 5per attempt (up to 60s), matching the per-job timeout model ✓deploy.sh:119—.[0].AllocStatus.AllocsNotYetRunningwrong path → FIXED: removed entirely; failure path now usesnomad job status -json | jq -r '.Allocations[]?.ID // empty'at line 128-129 ✓deploy.sh:155—.Evaluations[].Allocations[]?.IDwrong path → FIXED: changed to.Allocations[]?.IDat line 159 ✓Verification
Checked the full file to confirm:
nomad job deployments -jsonreturns an array →.[0].IDis correct for latest deploymentnomad deployment status -json <id>returns a single object →.Statusis now correctly appliednomad job status -json <name>has top-level.Allocations[]→ both failure and timeout log paths are now correctcurrent_status = "running") at line 203 remains intentional and correct (job-level status, not deployment-level)JOB_READY_TIMEOUT_<JOBNAME>is clean and correctVerdict
APPROVE — All four previous blockers closed; no new issues introduced
Reviewed at
3734920| Previous:dee05d2| AGENTS.mdAI Re-review (round 2): APPROVE — All four previous blockers closed; no new issues introduced