[nomad-step-5] deploy.sh 240s healthy_deadline too tight for chat cold-start #1036

New issue

Closed

opened 2026-04-19 08:42:02 +00:00 by dev-bot · 0 comments

dev-bot commented

2026-04-19 08:42:02 +00:00

Collaborator

Repro: ./bin/disinto init --backend=nomad --import-env /tmp/.env --with edge on fresh LXC.

Symptom: lib/init/nomad/deploy.sh times out waiting for chat to become healthy:

[deploy] waiting for job 'chat' to become healthy (timeout: 240s)...
...
[deploy] TIMEOUT: deployment 'aface5ba-...' did not reach successful state within 240s
[deploy] ERROR: deployment for job 'chat' did not reach successful state

Actual state: chat was running 30s later (nomad alloc status showed Started → running; task event 2026-04-19T08:18:37Z, healthy observed at ~08:22).

Known pattern (memory project_nomad_migration): chat cold-start ~200s; forgejo ~130s; 240s is too tight. This also causes init to bail before edge + vault-runner, masking their issues.

Fix: bump default HEALTHY_DEADLINE in lib/init/nomad/deploy.sh from 240 → 360. Alternatively, make it per-job with chat/edge getting 360, others staying at 240.

Side effect: Without this fix, ./bin/disinto init --with edge appears to succeed (exit 0) even though edge + vault-runner were never deployed. Consider also failing hard on any deploy.sh timeout so exit status reflects incomplete init.

Blocks: Step 6 cutover — init must be idempotent and hard-fail on partial deploy.

**Repro**: `./bin/disinto init --backend=nomad --import-env /tmp/.env --with edge` on fresh LXC. **Symptom**: `lib/init/nomad/deploy.sh` times out waiting for chat to become healthy: ``` [deploy] waiting for job 'chat' to become healthy (timeout: 240s)... ... [deploy] TIMEOUT: deployment 'aface5ba-...' did not reach successful state within 240s [deploy] ERROR: deployment for job 'chat' did not reach successful state ``` **Actual state**: chat _was_ running 30s later (`nomad alloc status` showed Started → running; task event 2026-04-19T08:18:37Z, healthy observed at ~08:22). **Known pattern** (memory project_nomad_migration): chat cold-start ~200s; forgejo ~130s; 240s is too tight. This also causes init to bail before edge + vault-runner, masking their issues. **Fix**: bump default `HEALTHY_DEADLINE` in `lib/init/nomad/deploy.sh` from 240 → 360. Alternatively, make it per-job with chat/edge getting 360, others staying at 240. **Side effect**: Without this fix, `./bin/disinto init --with edge` appears to succeed (exit 0) even though edge + vault-runner were never deployed. Consider also failing hard on any deploy.sh timeout so exit status reflects incomplete init. **Blocks**: Step 6 cutover — init must be idempotent and hard-fail on partial deploy.