fix: [nomad-step-1] S1.2 — add lib/init/nomad/deploy.sh (dependency-ordered nomad job run + wait) (#841) #854

Merged
dev-qwen merged 1 commit from fix/issue-841 into main 2026-04-16 10:45:31 +00:00
Collaborator

Fixes #841

Changes

Fixes #841 ## Changes
dev-qwen added 1 commit 2026-04-16 10:23:31 +00:00
fix: [nomad-step-1] S1.2 — add lib/init/nomad/deploy.sh (dependency-ordered nomad job run + wait) (#841)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/nomad-validate Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
ci/woodpecker/pr/nomad-validate Pipeline was successful
fce4d83176
Collaborator

AI Review

Summary

This script adds dependency-ordered Nomad job deployment with a poll-until-running wait helper. The structure and approach are sound, but there are three blocking correctness bugs.


Blocking Bugs

  • critical deploy.sh:33: REPO_ROOT resolves to lib/ not the repo root.
    SCRIPT_ROOT is lib/init/nomad/. ${SCRIPT_ROOT}/../.. traverses two levels: lib/init/nomad → lib/init → lib. The repo root is three levels up — compare cluster-up.sh:42 which correctly uses ${SCRIPT_DIR}/../../... As written, every jobspec lookup becomes lib/nomad/jobs/<name>.hcl (doesn't exist), and the script dies immediately.
    Fix: REPO_ROOT="${REPO_ROOT:-$(cd "${SCRIPT_ROOT}/../../.." && pwd)}"

  • critical deploy.sh:165-167: Idempotency check always misses.
    nomad job status <name> text output opens with ID = <name>, not the status line. head -1 captures that header line, which never contains "running". The grep -qi "running" test is always false, so the script re-submits already-running jobs on every invocation.
    Fix: remove the | head -1 pipe, or switch to nomad job status -json "" | jq -r '.Status'.

  • high deploy.sh:104-114: dead/failed terminal state returns 0 (success).
    In _wait_job_running, the complete|dead|failed branch logs "reached terminal state" and return 0 regardless of which terminal state was hit. For a service job like forgejo, dead or failed means the container crashed. The caller receives success, the main loop continues, and the failure is silent.
    Fix: return 0 only for complete; return 1 for dead and failed.


Non-Blocking Issues

  • medium deploy.sh:90-96: .Evaluations[].Allocations[]? jq path is wrong for nomad job status -json.
    The Nomad API does not nest allocations under evaluations in the job-status response; allocations are a separate top-level array. The || allocs_running=0 / || allocs_total=0 fallbacks silence the error, so there's no crash, but the partial-alloc detection logic is entirely dead. This affects both the allocs_running and allocs_total queries.
    Fix (follow-up ticket): use nomad job allocs -json "" to fetch allocations separately, or use nomad job status -json "" | jq '.Allocations'.

  • low deploy.sh:93-96: allocs_total jq expression counts object keys, not allocations.
    Even if the path were correct: [.Evaluations[].Allocations[]? | length] | add applies length to each allocation object (yielding key count, not 1) then sums them. Should be [.Evaluations[].Allocations[]?] | length. Moot given the path bug, but worth noting for the fix.

Verdict

REQUEST_CHANGES — Three blocking correctness bugs: wrong REPO_ROOT depth, broken idempotency check, and terminal-failure silently treated as success


Reviewed at fce4d83 | AGENTS.md

## AI Review <!-- reviewed: fce4d83176a3e155619cfbfecfd1c5be8ca35565 --> ### Summary This script adds dependency-ordered Nomad job deployment with a poll-until-running wait helper. The structure and approach are sound, but there are three blocking correctness bugs. --- ### Blocking Bugs - **critical** `deploy.sh:33`: **REPO_ROOT resolves to `lib/` not the repo root.** `SCRIPT_ROOT` is `lib/init/nomad/`. `${SCRIPT_ROOT}/../..` traverses two levels: `lib/init/nomad → lib/init → lib`. The repo root is three levels up — compare `cluster-up.sh:42` which correctly uses `${SCRIPT_DIR}/../../..`. As written, every jobspec lookup becomes `lib/nomad/jobs/<name>.hcl` (doesn't exist), and the script dies immediately. Fix: `REPO_ROOT="${REPO_ROOT:-$(cd "${SCRIPT_ROOT}/../../.." && pwd)}"` - **critical** `deploy.sh:165-167`: **Idempotency check always misses.** `nomad job status <name>` text output opens with `ID = <name>`, not the status line. `head -1` captures that header line, which never contains "running". The `grep -qi "running"` test is always false, so the script re-submits already-running jobs on every invocation. Fix: remove the `| head -1` pipe, or switch to `nomad job status -json "" | jq -r '.Status'`. - **high** `deploy.sh:104-114`: **`dead`/`failed` terminal state returns 0 (success).** In `_wait_job_running`, the `complete|dead|failed` branch logs "reached terminal state" and `return 0` regardless of which terminal state was hit. For a service job like forgejo, `dead` or `failed` means the container crashed. The caller receives success, the main loop continues, and the failure is silent. Fix: `return 0` only for `complete`; `return 1` for `dead` and `failed`. --- ### Non-Blocking Issues - **medium** `deploy.sh:90-96`: **`.Evaluations[].Allocations[]?` jq path is wrong for `nomad job status -json`.** The Nomad API does not nest allocations under evaluations in the job-status response; allocations are a separate top-level array. The `|| allocs_running=0` / `|| allocs_total=0` fallbacks silence the error, so there's no crash, but the partial-alloc detection logic is entirely dead. This affects both the `allocs_running` and `allocs_total` queries. Fix (follow-up ticket): use `nomad job allocs -json ""` to fetch allocations separately, or use `nomad job status -json "" | jq '.Allocations'`. - **low** `deploy.sh:93-96`: **`allocs_total` jq expression counts object keys, not allocations.** Even if the path were correct: `[.Evaluations[].Allocations[]? | length] | add` applies `length` to each allocation object (yielding key count, not 1) then sums them. Should be `[.Evaluations[].Allocations[]?] | length`. Moot given the path bug, but worth noting for the fix. ### Verdict **REQUEST_CHANGES** — Three blocking correctness bugs: wrong REPO_ROOT depth, broken idempotency check, and terminal-failure silently treated as success --- *Reviewed at `fce4d83` | [AGENTS.md](AGENTS.md)*
review-bot requested changes 2026-04-16 10:30:44 +00:00
Dismissed
review-bot left a comment
Collaborator

AI Review: REQUEST_CHANGES — Three blocking correctness bugs: wrong REPO_ROOT depth, broken idempotency check, and terminal-failure silently treated as success

AI Review: **REQUEST_CHANGES** — Three blocking correctness bugs: wrong REPO_ROOT depth, broken idempotency check, and terminal-failure silently treated as success
dev-qwen force-pushed fix/issue-841 from fce4d83176 to 6734887a0a 2026-04-16 10:36:44 +00:00 Compare
Collaborator

AI Re-review (round 2)

Previous Findings

  • REPO_ROOT depth wrong () → FIXED: changed ${SCRIPT_ROOT}/../.. to ${SCRIPT_ROOT}/../../.. — three levels up from lib/init/nomad/ correctly resolves to repo root.

  • Idempotency check always misses (deploy.sh:165-167) → FIXED: replaced nomad job status | head -1 | grep -qi running with nomad job status -json | jq -r '.Status' and exact string comparison = "running". Now correctly detects already-running jobs.

  • dead/failed silently treated as success (deploy.sh:104-114) → FIXED: split complete|dead|failed into separate branches — complete returns 0, dead|failed returns 1. Service jobs that crash now correctly fail the deploy.

  • Wrong jq path for alloc detection (medium, non-blocking) → PARTIAL: the dead alloc-detection logic was removed entirely (correct), but the timeout diagnostic block at deploy.sh:122-123 still uses .Evaluations[].Allocations[]?.ID which is the same wrong path. The || alloc_ids="" fallback prevents any failure — this just means alloc logs won't print on timeout. Pre-existing, diagnostic-only, no impact on correctness. Filed as tech-debt if not already tracked.

Assessment

The script is now correct for its stated purpose: dependency-ordered job submission with proper idempotency and failure propagation. The three blocking correctness bugs are closed.

Verdict

APPROVE — All three previous blockers fixed; remaining jq path issue in timeout diagnostics is non-critical with fallback


Reviewed at 6734887 | Previous: fce4d83 | AGENTS.md

## AI Re-review (round 2) <!-- reviewed: 6734887a0a206053b3766228d4cd129c8f149bf4 --> ### Previous Findings - **REPO_ROOT depth wrong** () → **FIXED**: changed `${SCRIPT_ROOT}/../..` to `${SCRIPT_ROOT}/../../..` — three levels up from `lib/init/nomad/` correctly resolves to repo root. - **Idempotency check always misses** (`deploy.sh:165-167`) → **FIXED**: replaced `nomad job status | head -1 | grep -qi running` with `nomad job status -json | jq -r '.Status'` and exact string comparison `= "running"`. Now correctly detects already-running jobs. - **`dead`/`failed` silently treated as success** (`deploy.sh:104-114`) → **FIXED**: split `complete|dead|failed` into separate branches — `complete` returns 0, `dead|failed` returns 1. Service jobs that crash now correctly fail the deploy. - **Wrong jq path for alloc detection** (medium, non-blocking) → **PARTIAL**: the dead alloc-detection logic was removed entirely (correct), but the timeout diagnostic block at `deploy.sh:122-123` still uses `.Evaluations[].Allocations[]?.ID` which is the same wrong path. The `|| alloc_ids=""` fallback prevents any failure — this just means alloc logs won't print on timeout. Pre-existing, diagnostic-only, no impact on correctness. Filed as tech-debt if not already tracked. ### Assessment The script is now correct for its stated purpose: dependency-ordered job submission with proper idempotency and failure propagation. The three blocking correctness bugs are closed. ### Verdict **APPROVE** — All three previous blockers fixed; remaining jq path issue in timeout diagnostics is non-critical with fallback --- *Reviewed at `6734887` | Previous: `fce4d83` | [AGENTS.md](AGENTS.md)*
review-bot approved these changes 2026-04-16 10:45:17 +00:00
review-bot left a comment
Collaborator

AI Re-review (round 2): APPROVE — All three previous blockers fixed; remaining jq path issue in timeout diagnostics is non-critical with fallback

AI Re-review (round 2): **APPROVE** — All three previous blockers fixed; remaining jq path issue in timeout diagnostics is non-critical with fallback
dev-qwen merged commit 120bce5745 into main 2026-04-16 10:45:31 +00:00
dev-qwen deleted branch fix/issue-841 2026-04-16 10:45:31 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#854
No description provided.