fix: [nomad-step-1] S1.2 — add lib/init/nomad/deploy.sh (dependency-ordered nomad job run + wait) (#841) #854
No reviewers
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#854
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/issue-841"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #841
Changes
AI Review
Summary
This script adds dependency-ordered Nomad job deployment with a poll-until-running wait helper. The structure and approach are sound, but there are three blocking correctness bugs.
Blocking Bugs
critical
deploy.sh:33: REPO_ROOT resolves tolib/not the repo root.SCRIPT_ROOTislib/init/nomad/.${SCRIPT_ROOT}/../..traverses two levels:lib/init/nomad → lib/init → lib. The repo root is three levels up — comparecluster-up.sh:42which correctly uses${SCRIPT_DIR}/../../... As written, every jobspec lookup becomeslib/nomad/jobs/<name>.hcl(doesn't exist), and the script dies immediately.Fix:
REPO_ROOT="${REPO_ROOT:-$(cd "${SCRIPT_ROOT}/../../.." && pwd)}"critical
deploy.sh:165-167: Idempotency check always misses.nomad job status <name>text output opens withID = <name>, not the status line.head -1captures that header line, which never contains "running". Thegrep -qi "running"test is always false, so the script re-submits already-running jobs on every invocation.Fix: remove the
| head -1pipe, or switch tonomad job status -json "" | jq -r '.Status'.high
deploy.sh:104-114:dead/failedterminal state returns 0 (success).In
_wait_job_running, thecomplete|dead|failedbranch logs "reached terminal state" andreturn 0regardless of which terminal state was hit. For a service job like forgejo,deadorfailedmeans the container crashed. The caller receives success, the main loop continues, and the failure is silent.Fix:
return 0only forcomplete;return 1fordeadandfailed.Non-Blocking Issues
medium
deploy.sh:90-96:.Evaluations[].Allocations[]?jq path is wrong fornomad job status -json.The Nomad API does not nest allocations under evaluations in the job-status response; allocations are a separate top-level array. The
|| allocs_running=0/|| allocs_total=0fallbacks silence the error, so there's no crash, but the partial-alloc detection logic is entirely dead. This affects both theallocs_runningandallocs_totalqueries.Fix (follow-up ticket): use
nomad job allocs -json ""to fetch allocations separately, or usenomad job status -json "" | jq '.Allocations'.low
deploy.sh:93-96:allocs_totaljq expression counts object keys, not allocations.Even if the path were correct:
[.Evaluations[].Allocations[]? | length] | addapplieslengthto each allocation object (yielding key count, not 1) then sums them. Should be[.Evaluations[].Allocations[]?] | length. Moot given the path bug, but worth noting for the fix.Verdict
REQUEST_CHANGES — Three blocking correctness bugs: wrong REPO_ROOT depth, broken idempotency check, and terminal-failure silently treated as success
Reviewed at
fce4d83| AGENTS.mdAI Review: REQUEST_CHANGES — Three blocking correctness bugs: wrong REPO_ROOT depth, broken idempotency check, and terminal-failure silently treated as success
fce4d83176to6734887a0aAI Re-review (round 2)
Previous Findings
REPO_ROOT depth wrong () → FIXED: changed
${SCRIPT_ROOT}/../..to${SCRIPT_ROOT}/../../..— three levels up fromlib/init/nomad/correctly resolves to repo root.Idempotency check always misses (
deploy.sh:165-167) → FIXED: replacednomad job status | head -1 | grep -qi runningwithnomad job status -json | jq -r '.Status'and exact string comparison= "running". Now correctly detects already-running jobs.dead/failedsilently treated as success (deploy.sh:104-114) → FIXED: splitcomplete|dead|failedinto separate branches —completereturns 0,dead|failedreturns 1. Service jobs that crash now correctly fail the deploy.Wrong jq path for alloc detection (medium, non-blocking) → PARTIAL: the dead alloc-detection logic was removed entirely (correct), but the timeout diagnostic block at
deploy.sh:122-123still uses.Evaluations[].Allocations[]?.IDwhich is the same wrong path. The|| alloc_ids=""fallback prevents any failure — this just means alloc logs won't print on timeout. Pre-existing, diagnostic-only, no impact on correctness. Filed as tech-debt if not already tracked.Assessment
The script is now correct for its stated purpose: dependency-ordered job submission with proper idempotency and failure propagation. The three blocking correctness bugs are closed.
Verdict
APPROVE — All three previous blockers fixed; remaining jq path issue in timeout diagnostics is non-critical with fallback
Reviewed at
6734887| Previous:fce4d83| AGENTS.mdAI Re-review (round 2): APPROVE — All three previous blockers fixed; remaining jq path issue in timeout diagnostics is non-critical with fallback