bug: deploy.sh 360s still too tight for chat cold-start + cascade-skip masks edge/vault-runner #1070

Closed
opened 2026-04-20 07:53:26 +00:00 by dev-bot · 0 comments
Collaborator

Goal: Deploy.sh 360s timeout (raised from 240s per #1036) is still too tight for chat cold-start. Dry-run on 2026-04-20 shows chat needing >360s on a fresh LXC.

Observed

Dry-run on fresh disinto-nomad-box (2026-04-20 07:42 UTC):

[deploy] waiting for job 'chat' to become healthy (timeout: 360s)...
...
[deploy] TIMEOUT: deployment 'c27a436d-267f-96e0-744c-49b98af9c781' did not reach successful state within 360s
[deploy] ERROR: deployment for job 'chat' did not reach successful state

But nomad alloc status <chat-alloc> ~60s later: State: running, task started 2026-04-20T07:43:14Z, healthy by ~07:49. Actual cold-start ≈ 6 minutes.

Impact

Init exits 0 (success) but has skipped submitting edge + vault-runner — the deploy loop bailed after chat timed out. Silent partial-deploy is a repeat of the pattern #1036 tried to fix. Operator has to manually nomad job run the remaining jobs.

Fix options

(a) bump JOB_READY_TIMEOUT_ to 600s in lib/init/nomad/deploy.sh — targeted fix for the slow task without slowing everything else
(b) continue on deploy timeout — don't let chat timeout block edge + vault-runner submission; log warning, keep going; final summary reports any unhealthy jobs
(c) both — bump chat's timeout AND make deploy.sh tolerant of timeouts (don't cascade-skip subsequent jobs)

Recommend (c): chat's real cold start is around 5–6 min even with the Nomad+LLM path; give it 600s. Separately, partial-deploy masking is the real bug — timeouts should not silently skip later jobs.

Scope hint

  • lib/init/nomad/deploy.sh: raise JOB_READY_TIMEOUT_CHAT (or generic default) from 360 → 600
  • Same file: when a deploy waits timeout, log WARNING, record into a deferred list, but continue submitting subsequent jobs; at end, print final health table
  • Previous bump in #1036 from 240s → 360s
  • Discovered during #1037 dry-run; affects smooth cutover rehearsal
**Goal**: Deploy.sh 360s timeout (raised from 240s per #1036) is still too tight for chat cold-start. Dry-run on 2026-04-20 shows chat needing >360s on a fresh LXC. ## Observed Dry-run on fresh disinto-nomad-box (2026-04-20 07:42 UTC): ``` [deploy] waiting for job 'chat' to become healthy (timeout: 360s)... ... [deploy] TIMEOUT: deployment 'c27a436d-267f-96e0-744c-49b98af9c781' did not reach successful state within 360s [deploy] ERROR: deployment for job 'chat' did not reach successful state ``` But `nomad alloc status <chat-alloc>` ~60s later: `State: running`, task started 2026-04-20T07:43:14Z, healthy by ~07:49. Actual cold-start ≈ 6 minutes. ## Impact Init exits 0 (success) but has skipped submitting edge + vault-runner — the deploy loop bailed after chat timed out. Silent partial-deploy is a repeat of the pattern #1036 tried to fix. Operator has to manually `nomad job run` the remaining jobs. ## Fix options **(a) bump JOB_READY_TIMEOUT_<CHAT> to 600s** in `lib/init/nomad/deploy.sh` — targeted fix for the slow task without slowing everything else **(b) continue on deploy timeout** — don't let chat timeout block edge + vault-runner submission; log warning, keep going; final summary reports any unhealthy jobs **(c) both** — bump chat's timeout AND make deploy.sh tolerant of timeouts (don't cascade-skip subsequent jobs) Recommend **(c)**: chat's real cold start is around 5–6 min even with the Nomad+LLM path; give it 600s. Separately, partial-deploy masking is the real bug — timeouts should not silently skip later jobs. ## Scope hint - `lib/init/nomad/deploy.sh`: raise `JOB_READY_TIMEOUT_CHAT` (or generic default) from 360 → 600 - Same file: when a deploy waits timeout, log WARNING, record into a deferred list, but continue submitting subsequent jobs; at end, print final health table ## Related - Previous bump in #1036 from 240s → 360s - Discovered during #1037 dry-run; affects smooth cutover rehearsal
dev-bot added the
backlog
bug-report
labels 2026-04-20 07:53:26 +00:00
dev-bot self-assigned this 2026-04-20 07:55:26 +00:00
dev-bot added
in-progress
and removed
backlog
labels 2026-04-20 07:55:26 +00:00
dev-bot removed their assignment 2026-04-20 08:04:28 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1070
No description provided.