[nomad-step-0] S0.2-fix — install.sh must also install docker daemon (block step 1 placement) #871

New issue

Closed

opened 2026-04-16 13:11:39 +00:00 by dev-bot · 0 comments

dev-bot commented

2026-04-16 13:11:39 +00:00

Collaborator

Bugfix for S0.2 (#822) / Step-0 install. Discovered during Step 1 end-to-end verification on a fresh LXC.

Symptom

On a freshly-launched ubuntu:24.04 LXC + disinto init --backend=nomad --with forgejo:

[deploy] waiting for job 'forgejo' to become running (timeout: 120s)...
[deploy] TIMEOUT: job 'forgejo' did not reach running state within 120s

nomad job status forgejo shows:

Placement Failure
Task Group "forgejo":
  * Constraint "missing drivers": 1 nodes excluded by filter

nomad node status -self -verbose shows:

Drivers
Driver    Detected  Healthy  Message
docker    false     false    Failed to connect to docker daemon

which docker → not found. systemctl is-active docker → inactive.

Root cause

lib/init/nomad/install.sh (from S0.2 #822) installs nomad and vault from the HashiCorp apt repo but does not install docker. On disinto-dev-box docker is pre-installed as part of the existing factory setup, so Step 0 verification passed silently — the cluster came up healthy and we never tried to place a docker-driver job.

Step 1's forgejo.hcl is the first job that actually needs the docker driver. The constraint filter rejects the node because the driver is unhealthy, and deploy.sh times out after 120s with no placement.

Fix

Extend lib/init/nomad/install.sh to also install docker when missing:

if ! command -v docker >/dev/null 2>&1; then
  echo "[install] installing docker-ce"
  # Ubuntu-native: `apt-get install docker.io` is sufficient for factory dev box
  # (matches the existing disinto-dev-box setup). The upstream docker-ce repo
  # is an option but adds a second apt source with pinning — keep it simple.
  apt-get install -y -q docker.io
  systemctl enable --now docker
fi

Then in cluster-up.sh step 8 (start nomad), add a short poll for the docker driver to report healthy before polling for node ready — otherwise the race between docker starting and nomad client health check can surface confusingly.

Acceptance criteria

On a fresh ubuntu:24.04 LXC + clone:

./bin/disinto init --backend=nomad --empty → cluster up, nomad node status -self -verbose | grep docker shows Detected=true Healthy=true.
./bin/disinto init --backend=nomad --with forgejo → forgejo job places, becomes running, curl http://localhost:3000/api/v1/version returns 200.
Re-running is a no-op (docker install already present on second run).
shellcheck clean.

Why Step 0 verification missed it

Step 0's test was "cluster healthy + idempotent re-run." Both passed because Nomad + Vault were up and the docker driver reporting Detected=false Healthy=false isn't a cluster-up failure — it's a driver-availability signal that only blocks at job-placement time.

Going forward: either add a driver-health assertion to Step 0 verification, or accept that Step 1's "deploy forgejo and hit :3000" IS the real Step 0 completeness test. Leaving as is (Step 1 = real integration test for Step 0 drivers) is acceptable — just document it.

Labels / meta

backlog + bug-report.

Bugfix for S0.2 (#822) / Step-0 install. Discovered during Step 1 end-to-end verification on a fresh LXC. ## Symptom On a freshly-launched `ubuntu:24.04` LXC + `disinto init --backend=nomad --with forgejo`: ``` [deploy] waiting for job 'forgejo' to become running (timeout: 120s)... [deploy] TIMEOUT: job 'forgejo' did not reach running state within 120s ``` `nomad job status forgejo` shows: ``` Placement Failure Task Group "forgejo": * Constraint "missing drivers": 1 nodes excluded by filter ``` `nomad node status -self -verbose` shows: ``` Drivers Driver Detected Healthy Message docker false false Failed to connect to docker daemon ``` `which docker` → not found. `systemctl is-active docker` → inactive. ## Root cause `lib/init/nomad/install.sh` (from S0.2 #822) installs `nomad` and `vault` from the HashiCorp apt repo but does **not** install docker. On `disinto-dev-box` docker is pre-installed as part of the existing factory setup, so Step 0 verification passed silently — the cluster came up healthy and we never tried to place a docker-driver job. Step 1's `forgejo.hcl` is the first job that actually needs the docker driver. The constraint filter rejects the node because the driver is unhealthy, and `deploy.sh` times out after 120s with no placement. ## Fix Extend `lib/init/nomad/install.sh` to also install docker when missing: ```bash if ! command -v docker >/dev/null 2>&1; then echo "[install] installing docker-ce" # Ubuntu-native: `apt-get install docker.io` is sufficient for factory dev box # (matches the existing disinto-dev-box setup). The upstream docker-ce repo # is an option but adds a second apt source with pinning — keep it simple. apt-get install -y -q docker.io systemctl enable --now docker fi ``` Then in `cluster-up.sh` step 8 (start nomad), add a short poll for the docker driver to report healthy before polling for node ready — otherwise the race between docker starting and nomad client health check can surface confusingly. ## Acceptance criteria On a fresh `ubuntu:24.04` LXC + clone: - `./bin/disinto init --backend=nomad --empty` → cluster up, `nomad node status -self -verbose | grep docker` shows `Detected=true Healthy=true`. - `./bin/disinto init --backend=nomad --with forgejo` → forgejo job places, becomes `running`, `curl http://localhost:3000/api/v1/version` returns 200. - Re-running is a no-op (docker install already present on second run). - `shellcheck` clean. ## Why Step 0 verification missed it Step 0's test was "cluster healthy + idempotent re-run." Both passed because Nomad + Vault were up and the docker driver reporting `Detected=false Healthy=false` isn't a cluster-up failure — it's a driver-availability signal that only blocks at job-placement time. Going forward: either add a driver-health assertion to Step 0 verification, or accept that Step 1's "deploy forgejo and hit :3000" IS the real Step 0 completeness test. Leaving as is (Step 1 = real integration test for Step 0 drivers) is acceptable — just document it. ## Labels / meta - `backlog` + bug-report.