[nomad-step-0] S0.2-fix — install.sh must also install docker daemon (block step 1 placement) #871
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#871
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Bugfix for S0.2 (#822) / Step-0 install. Discovered during Step 1 end-to-end verification on a fresh LXC.
Symptom
On a freshly-launched
ubuntu:24.04LXC +disinto init --backend=nomad --with forgejo:nomad job status forgejoshows:nomad node status -self -verboseshows:which docker→ not found.systemctl is-active docker→ inactive.Root cause
lib/init/nomad/install.sh(from S0.2 #822) installsnomadandvaultfrom the HashiCorp apt repo but does not install docker. Ondisinto-dev-boxdocker is pre-installed as part of the existing factory setup, so Step 0 verification passed silently — the cluster came up healthy and we never tried to place a docker-driver job.Step 1's
forgejo.hclis the first job that actually needs the docker driver. The constraint filter rejects the node because the driver is unhealthy, anddeploy.shtimes out after 120s with no placement.Fix
Extend
lib/init/nomad/install.shto also install docker when missing:Then in
cluster-up.shstep 8 (start nomad), add a short poll for the docker driver to report healthy before polling for node ready — otherwise the race between docker starting and nomad client health check can surface confusingly.Acceptance criteria
On a fresh
ubuntu:24.04LXC + clone:./bin/disinto init --backend=nomad --empty→ cluster up,nomad node status -self -verbose | grep dockershowsDetected=true Healthy=true../bin/disinto init --backend=nomad --with forgejo→ forgejo job places, becomesrunning,curl http://localhost:3000/api/v1/versionreturns 200.shellcheckclean.Why Step 0 verification missed it
Step 0's test was "cluster healthy + idempotent re-run." Both passed because Nomad + Vault were up and the docker driver reporting
Detected=false Healthy=falseisn't a cluster-up failure — it's a driver-availability signal that only blocks at job-placement time.Going forward: either add a driver-health assertion to Step 0 verification, or accept that Step 1's "deploy forgejo and hit :3000" IS the real Step 0 completeness test. Leaving as is (Step 1 = real integration test for Step 0 drivers) is acceptable — just document it.
Labels / meta
backlog+ bug-report.