[nomad-step-1] deploy.sh-fix — poll deployment status not alloc status; bump timeout 120→240s #878
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#878
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Bugfix for S1.2 (#841). Discovered during Step 1 end-to-end verification — Forgejo deployed successfully but
deploy.shexited with a misleading "TIMEOUT" message.Symptom
On a fresh LXC running
disinto init --backend=nomad --with forgejo:Exit 1. But the job actually succeeded:
And
curl http://10.10.10.70:3000/api/v1/versionreturned{"version":"11.0.12+gitea-1.22.0"}.Root cause
deploy.sh's_wait_job_runningpolls allocation status and returns onStatus == "running". For Nomad service jobs with acheckstanza, the allocation transitions torunningfast, but the deployment only becomessuccessfulafter thecheckreportspassingfor themin_healthy_timeduration. Forgejo's first-run migration takes ~90–120s before healthcheck passes; 120s is a coin-flip for the current Forgejo image on a cold LXC.Two separable bugs:
nomad deployment status -json <deployment-id>→ wait forStatus == "successful".Fix
deploy.sh, replace alloc-status polling in_wait_job_runningwith:nomad job deployments -json <job> | jq '.[0].ID'.nomad deployment status -json <deployment-id>for.Status == "successful"or"failed".failed→ exit 1 and dump alloc stderr as today.successful→ exit 0, print[deploy] <job> healthy after Ns.JOB_READY_TIMEOUT_SECSfrom 120 → 240.JOB_READY_TIMEOUT_<JOBNAME>env var (e.g.JOB_READY_TIMEOUT_FORGEJO=300) so slow-starting services get more headroom without stretching fast ones.Acceptance criteria
disinto init --backend=nomad --with forgejoexits 0 with[deploy] forgejo healthy after Nsand no "TIMEOUT" line.deploy.shexits 1 within 60s with deployment-failed error + stderr dump.nomad job runreturns no-change, deploy.sh prints[deploy] forgejo already healthy).shellcheckclean.Why Step 1 "passed" despite this
The job was running and reachable, the bug was cosmetic + a non-zero exit. But the non-zero exit is a real problem:
disinto initexits 1, downstream automation (cutover scripts) interpret that as failure, and the operator has to manually verify. Gets worse in Step 3+ where multiple services chain deploy calls.Labels / meta
backlog+bug-report. Not a migration-blocker, but lands before Step 3 (woodpecker's healthcheck window is longer than forgejo's).