[nomad-step-5] deploy.sh 240s healthy_deadline too tight for chat cold-start #1036
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#1036
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Repro:
./bin/disinto init --backend=nomad --import-env /tmp/.env --with edgeon fresh LXC.Symptom:
lib/init/nomad/deploy.shtimes out waiting for chat to become healthy:Actual state: chat was running 30s later (
nomad alloc statusshowed Started → running; task event 2026-04-19T08:18:37Z, healthy observed at ~08:22).Known pattern (memory project_nomad_migration): chat cold-start ~200s; forgejo ~130s; 240s is too tight. This also causes init to bail before edge + vault-runner, masking their issues.
Fix: bump default
HEALTHY_DEADLINEinlib/init/nomad/deploy.shfrom 240 → 360. Alternatively, make it per-job with chat/edge getting 360, others staying at 240.Side effect: Without this fix,
./bin/disinto init --with edgeappears to succeed (exit 0) even though edge + vault-runner were never deployed. Consider also failing hard on any deploy.sh timeout so exit status reflects incomplete init.Blocks: Step 6 cutover — init must be idempotent and hard-fail on partial deploy.