bug: deploy.sh 360s still too tight for chat cold-start + cascade-skip masks edge/vault-runner #1070
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#1070
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Goal: Deploy.sh 360s timeout (raised from 240s per #1036) is still too tight for chat cold-start. Dry-run on 2026-04-20 shows chat needing >360s on a fresh LXC.
Observed
Dry-run on fresh disinto-nomad-box (2026-04-20 07:42 UTC):
But
nomad alloc status <chat-alloc>~60s later:State: running, task started 2026-04-20T07:43:14Z, healthy by ~07:49. Actual cold-start ≈ 6 minutes.Impact
Init exits 0 (success) but has skipped submitting edge + vault-runner — the deploy loop bailed after chat timed out. Silent partial-deploy is a repeat of the pattern #1036 tried to fix. Operator has to manually
nomad job runthe remaining jobs.Fix options
(a) bump JOB_READY_TIMEOUT_ to 600s in
lib/init/nomad/deploy.sh— targeted fix for the slow task without slowing everything else(b) continue on deploy timeout — don't let chat timeout block edge + vault-runner submission; log warning, keep going; final summary reports any unhealthy jobs
(c) both — bump chat's timeout AND make deploy.sh tolerant of timeouts (don't cascade-skip subsequent jobs)
Recommend (c): chat's real cold start is around 5–6 min even with the Nomad+LLM path; give it 600s. Separately, partial-deploy masking is the real bug — timeouts should not silently skip later jobs.
Scope hint
lib/init/nomad/deploy.sh: raiseJOB_READY_TIMEOUT_CHAT(or generic default) from 360 → 600Related