[nomad-step-1] deploy.sh-fix — poll deployment status not alloc status; bump timeout 120→240s #878

New issue

Closed

opened 2026-04-16 15:26:33 +00:00 by dev-bot · 0 comments

dev-bot commented

2026-04-16 15:26:33 +00:00

Collaborator

Bugfix for S1.2 (#841). Discovered during Step 1 end-to-end verification — Forgejo deployed successfully but deploy.sh exited with a misleading "TIMEOUT" message.

Symptom

On a fresh LXC running disinto init --backend=nomad --with forgejo:

[deploy] waiting for job 'forgejo' to become running (timeout: 120s)...
[deploy] TIMEOUT: job 'forgejo' did not reach running state within 120s
[deploy] showing last 50 lines of allocation logs (stderr):
[deploy] ERROR: timeout waiting for job 'forgejo' to become running

Exit 1. But the job actually succeeded:

$ nomad job status forgejo
Latest Deployment  Status = successful
Allocations  ... Status=running

And curl http://10.10.10.70:3000/api/v1/version returned {"version":"11.0.12+gitea-1.22.0"}.

Root cause

deploy.sh's _wait_job_running polls allocation status and returns on Status == "running". For Nomad service jobs with a check stanza, the allocation transitions to running fast, but the deployment only becomes successful after the check reports passing for the min_healthy_time duration. Forgejo's first-run migration takes ~90–120s before healthcheck passes; 120s is a coin-flip for the current Forgejo image on a cold LXC.

Two separable bugs:

Wrong signal. Polling alloc status instead of deployment status gives false positives (alloc running ≠ service healthy) and false negatives (alloc running before deployment considered successful). Correct signal: nomad deployment status -json <deployment-id> → wait for Status == "successful".
Timeout too tight. Even if we keep polling alloc status as a fallback, 120s is shorter than Forgejo's cold-start window. Default should be ≥240s with per-job override available.

Fix

In deploy.sh, replace alloc-status polling in _wait_job_running with:
- Resolve the latest deployment ID for the job via nomad job deployments -json <job> | jq '.[0].ID'.
- Poll nomad deployment status -json <deployment-id> for .Status == "successful" or "failed".
- On failed → exit 1 and dump alloc stderr as today.
- On successful → exit 0, print [deploy] <job> healthy after Ns.
Bump default JOB_READY_TIMEOUT_SECS from 120 → 240.
Allow per-job override via JOB_READY_TIMEOUT_<JOBNAME> env var (e.g. JOB_READY_TIMEOUT_FORGEJO=300) so slow-starting services get more headroom without stretching fast ones.

Acceptance criteria

Fresh LXC + disinto init --backend=nomad --with forgejo exits 0 with [deploy] forgejo healthy after Ns and no "TIMEOUT" line.
Deliberately-broken jobspec (e.g. bad image tag) → deploy.sh exits 1 within 60s with deployment-failed error + stderr dump.
Re-running on an already-healthy cluster → no-op (deployment already successful, nomad job run returns no-change, deploy.sh prints [deploy] forgejo already healthy).
shellcheck clean.

Why Step 1 "passed" despite this

The job was running and reachable, the bug was cosmetic + a non-zero exit. But the non-zero exit is a real problem: disinto init exits 1, downstream automation (cutover scripts) interpret that as failure, and the operator has to manually verify. Gets worse in Step 3+ where multiple services chain deploy calls.

Labels / meta

backlog + bug-report. Not a migration-blocker, but lands before Step 3 (woodpecker's healthcheck window is longer than forgejo's).

Bugfix for S1.2 (#841). Discovered during Step 1 end-to-end verification — Forgejo deployed successfully but `deploy.sh` exited with a misleading "TIMEOUT" message. ## Symptom On a fresh LXC running `disinto init --backend=nomad --with forgejo`: ``` [deploy] waiting for job 'forgejo' to become running (timeout: 120s)... [deploy] TIMEOUT: job 'forgejo' did not reach running state within 120s [deploy] showing last 50 lines of allocation logs (stderr): [deploy] ERROR: timeout waiting for job 'forgejo' to become running ``` Exit 1. But **the job actually succeeded**: ``` $ nomad job status forgejo Latest Deployment Status = successful Allocations ... Status=running ``` And `curl http://10.10.10.70:3000/api/v1/version` returned `{"version":"11.0.12+gitea-1.22.0"}`. ## Root cause `deploy.sh`'s `_wait_job_running` polls **allocation status** and returns on `Status == "running"`. For Nomad service jobs with a `check` stanza, the allocation transitions to `running` fast, but the **deployment** only becomes `successful` after the `check` reports `passing` for the `min_healthy_time` duration. Forgejo's first-run migration takes ~90–120s before healthcheck passes; 120s is a coin-flip for the current Forgejo image on a cold LXC. Two separable bugs: 1. **Wrong signal.** Polling alloc status instead of deployment status gives false positives (alloc running ≠ service healthy) and false negatives (alloc running before deployment considered successful). Correct signal: `nomad deployment status -json <deployment-id>` → wait for `Status == "successful"`. 2. **Timeout too tight.** Even if we keep polling alloc status as a fallback, 120s is shorter than Forgejo's cold-start window. Default should be ≥240s with per-job override available. ## Fix - In `deploy.sh`, replace alloc-status polling in `_wait_job_running` with: - Resolve the latest deployment ID for the job via `nomad job deployments -json <job> | jq '.[0].ID'`. - Poll `nomad deployment status -json <deployment-id>` for `.Status == "successful"` or `"failed"`. - On `failed` → exit 1 and dump alloc stderr as today. - On `successful` → exit 0, print `[deploy] <job> healthy after Ns`. - Bump default `JOB_READY_TIMEOUT_SECS` from 120 → 240. - Allow per-job override via `JOB_READY_TIMEOUT_<JOBNAME>` env var (e.g. `JOB_READY_TIMEOUT_FORGEJO=300`) so slow-starting services get more headroom without stretching fast ones. ## Acceptance criteria - Fresh LXC + `disinto init --backend=nomad --with forgejo` exits 0 with `[deploy] forgejo healthy after Ns` and no "TIMEOUT" line. - Deliberately-broken jobspec (e.g. bad image tag) → `deploy.sh` exits 1 within 60s with deployment-failed error + stderr dump. - Re-running on an already-healthy cluster → no-op (deployment already successful, `nomad job run` returns no-change, deploy.sh prints `[deploy] forgejo already healthy`). - `shellcheck` clean. ## Why Step 1 "passed" despite this The job was running and reachable, the bug was cosmetic + a non-zero exit. But the non-zero exit is a real problem: `disinto init` exits 1, downstream automation (cutover scripts) interpret that as failure, and the operator has to manually verify. Gets worse in Step 3+ where multiple services chain deploy calls. ## Labels / meta - `backlog` + `bug-report`. Not a migration-blocker, but lands before Step 3 (woodpecker's healthcheck window is longer than forgejo's).