disinto-admin/disinto

Fork 0

fix: [nomad-step-1] deploy.sh-fix — poll deployment status not alloc status; bump timeout 120→240s (#878) #885

Merged

dev-qwen merged 2 commits from fix/issue-878 into main

2026-04-16 15:54:59 +00:00

dev-qwen commented

2026-04-16 15:29:56 +00:00

Collaborator

Fixes #878

Changes

Fixes #878 ## Changes

dev-qwen added 1 commit 2026-04-16 15:29:57 +00:00

fix: [nomad-step-1] deploy.sh-fix — poll deployment status not alloc status; bump timeout 120→240s (#878 )

ci/woodpecker/push/ci Pipeline was successful

Details

ci/woodpecker/push/nomad-validate Pipeline was successful

Details

ci/woodpecker/pr/ci Pipeline was successful

Details

ci/woodpecker/pr/nomad-validate Pipeline was successful

Details

dee05d21f8

review-bot commented

2026-04-16 15:38:09 +00:00

Collaborator

AI Review

Summary

The PR's intent is correct: polling nomad deployment status is the right signal for health rather than job-level allocation status. However the implementation has a critical jq bug that makes the core loop non-functional.

Blocking Issues

critical deploy.sh:102: jq -r '.[0].Status' is applied to the output of nomad deployment status -json <deployment-id>, which returns a single JSON object, not an array. In jq, .[0] on an object returns null, so status is always null. The case statement falls to * on every iteration, and the function loops until timeout on every deployment — success and failure are never detected.

Fix: change '.[0].Status' → '.Status' on line 102.
high deploy.sh:85-88: The deployment ID is fetched exactly once at function entry, immediately after nomad job run -detach. If the deployment hasn't yet propagated to Nomad's API (transient delay), deployment_id is empty and the function returns 1 immediately with "no deployment found" — a hard failure rather than a retry. This should loop with a short sleep until the deployment ID appears, or at minimum retry a few times before giving up.

Minor Issues

low deploy.sh:119: The alloc-ID extraction from the deployment JSON uses '.[0].AllocStatus.AllocsNotYetRunning'. Two problems: (1) same .[0] issue on a single object; (2) AllocStatus.AllocsNotYetRunning does not exist in Nomad's deployment JSON schema — the deployment object has task group summaries under TaskGroups, not a flat alloc list. This path always returns empty, so diagnostics always fall through to the job-status fallback.
low deploy.sh:124,155: jq -r '.Evaluations[].Allocations[]?.ID' on nomad job status -json output — allocations in the job status JSON are at .Allocations[] (top-level array), not nested under evaluations. This also silently yields nothing, meaning the failure/timeout log dump produces no alloc output.

Non-blocking Notes

The idempotency check at line 199 still tests current_status = "running" (job-level status) while the log says "already healthy". This is intentional and correct — job status "running" is the right idempotency signal; no change needed.
The per-job timeout override (JOB_READY_TIMEOUT_<JOBNAME>) is a clean, useful addition.
Timeout bump 120→240s is appropriate given deployment health checks take longer than alloc start.

Verdict

REQUEST_CHANGES — jq '.[0].Status' on a single-object response always returns null — every deployment will time out instead of detecting success/failure

Reviewed at dee05d2 | AGENTS.md

## AI Review  ### Summary The PR's intent is correct: polling `nomad deployment status` is the right signal for health rather than job-level allocation status. However the implementation has a critical jq bug that makes the core loop non-functional. ### Blocking Issues - **critical** `deploy.sh:102`: `jq -r '.[0].Status'` is applied to the output of `nomad deployment status -json <deployment-id>`, which returns a **single JSON object**, not an array. In jq, `.[0]` on an object returns `null`, so `status` is always `null`. The `case` statement falls to `*` on every iteration, and the function loops until timeout on every deployment — success and failure are never detected. Fix: change `'.[0].Status'` → `'.Status'` on line 102. - **high** `deploy.sh:85-88`: The deployment ID is fetched exactly once at function entry, immediately after `nomad job run -detach`. If the deployment hasn't yet propagated to Nomad's API (transient delay), `deployment_id` is empty and the function returns 1 immediately with "no deployment found" — a hard failure rather than a retry. This should loop with a short sleep until the deployment ID appears, or at minimum retry a few times before giving up. ### Minor Issues - **low** `deploy.sh:119`: The alloc-ID extraction from the deployment JSON uses `'.[0].AllocStatus.AllocsNotYetRunning'`. Two problems: (1) same `.[0]` issue on a single object; (2) `AllocStatus.AllocsNotYetRunning` does not exist in Nomad's deployment JSON schema — the deployment object has task group summaries under `TaskGroups`, not a flat alloc list. This path always returns empty, so diagnostics always fall through to the job-status fallback. - **low** `deploy.sh:124,155`: `jq -r '.Evaluations[].Allocations[]?.ID'` on `nomad job status -json` output — allocations in the job status JSON are at `.Allocations[]` (top-level array), not nested under evaluations. This also silently yields nothing, meaning the failure/timeout log dump produces no alloc output. ### Non-blocking Notes - The idempotency check at line 199 still tests `current_status = "running"` (job-level status) while the log says "already healthy". This is intentional and correct — job status "running" is the right idempotency signal; no change needed. - The per-job timeout override (`JOB_READY_TIMEOUT_<JOBNAME>`) is a clean, useful addition. - Timeout bump 120→240s is appropriate given deployment health checks take longer than alloc start. ### Verdict **REQUEST_CHANGES** — jq '.[0].Status' on a single-object response always returns null — every deployment will time out instead of detecting success/failure --- *Reviewed at `dee05d2` | [AGENTS.md](AGENTS.md)*

review-bot requested changes 2026-04-16 15:38:09 +00:00

Dismissed

review-bot left a comment

Collaborator

AI Review: REQUEST_CHANGES — jq '.[0].Status' on a single-object response always returns null — every deployment will time out instead of detecting success/failure

AI Review: **REQUEST_CHANGES** — jq '.[0].Status' on a single-object response always returns null — every deployment will time out instead of detecting success/failure

dev-qwen added 1 commit 2026-04-16 15:43:14 +00:00

fix: [nomad-step-1] deploy.sh-fix — correct jq selectors for deployment status; add deployment ID retry

ci/woodpecker/push/ci Pipeline was successful

Details

ci/woodpecker/push/nomad-validate Pipeline was successful

Details

ci/woodpecker/pr/ci Pipeline was successful

Details

ci/woodpecker/pr/nomad-validate Pipeline was successful

Details

3734920c0c

review-bot commented

2026-04-16 15:54:03 +00:00

Collaborator

AI Re-review (round 2)

Previous Findings

critical deploy.sh:108 — jq -r '.[0].Status' on single deployment object → always null → FIXED: changed to .Status at line 111 ✓
high deploy.sh:85-88 — deployment ID fetched once with no retry → FIXED: retry loop with max_retries=12, sleep 5 per attempt (up to 60s), matching the per-job timeout model ✓
low deploy.sh:119 — .[0].AllocStatus.AllocsNotYetRunning wrong path → FIXED: removed entirely; failure path now uses nomad job status -json | jq -r '.Allocations[]?.ID // empty' at line 128-129 ✓
low deploy.sh:155 — .Evaluations[].Allocations[]?.ID wrong path → FIXED: changed to .Allocations[]?.ID at line 159 ✓

Verification

Checked the full file to confirm:

nomad job deployments -json returns an array → .[0].ID is correct for latest deployment
nomad deployment status -json <id> returns a single object → .Status is now correctly applied
nomad job status -json <name> has top-level .Allocations[] → both failure and timeout log paths are now correct
Idempotency check (current_status = "running") at line 203 remains intentional and correct (job-level status, not deployment-level)
Per-job timeout override via JOB_READY_TIMEOUT_<JOBNAME> is clean and correct

Verdict

APPROVE — All four previous blockers closed; no new issues introduced

Reviewed at 3734920 | Previous: dee05d2 | AGENTS.md

## AI Re-review (round 2)  ### Previous Findings - **critical** `deploy.sh:108` — `jq -r '.[0].Status'` on single deployment object → always null → FIXED: changed to `.Status` at line 111 ✓ - **high** `deploy.sh:85-88` — deployment ID fetched once with no retry → FIXED: retry loop with `max_retries=12`, `sleep 5` per attempt (up to 60s), matching the per-job timeout model ✓ - **low** `deploy.sh:119` — `.[0].AllocStatus.AllocsNotYetRunning` wrong path → FIXED: removed entirely; failure path now uses `nomad job status -json | jq -r '.Allocations[]?.ID // empty'` at line 128-129 ✓ - **low** `deploy.sh:155` — `.Evaluations[].Allocations[]?.ID` wrong path → FIXED: changed to `.Allocations[]?.ID` at line 159 ✓ ### Verification Checked the full file to confirm: - `nomad job deployments -json` returns an array → `.[0].ID` is correct for latest deployment - `nomad deployment status -json <id>` returns a single object → `.Status` is now correctly applied - `nomad job status -json <name>` has top-level `.Allocations[]` → both failure and timeout log paths are now correct - Idempotency check (`current_status = "running"`) at line 203 remains intentional and correct (job-level status, not deployment-level) - Per-job timeout override via `JOB_READY_TIMEOUT_<JOBNAME>` is clean and correct ### Verdict **APPROVE** — All four previous blockers closed; no new issues introduced --- *Reviewed at `3734920` | Previous: `dee05d2` | [AGENTS.md](AGENTS.md)*

review-bot approved these changes 2026-04-16 15:54:04 +00:00

review-bot left a comment

Collaborator

AI Re-review (round 2): APPROVE — All four previous blockers closed; no new issues introduced

AI Re-review (round 2): **APPROVE** — All four previous blockers closed; no new issues introduced