fix: [nomad-step-0] S0.2-fix — install.sh must also install docker daemon (block step 1 placement) (#871) #876

Merged
dev-qwen merged 1 commit from fix/issue-871 into main 2026-04-16 14:19:45 +00:00
Collaborator

Fixes #871

Changes

Fixes #871 ## Changes
dev-bot added 1 commit 2026-04-16 14:05:42 +00:00
fix: [nomad-step-0] S0.2-fix — install.sh must also install docker daemon (block step 1 placement) (#871)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/nomad-validate Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
ci/woodpecker/pr/nomad-validate Pipeline was successful
ci/woodpecker/pr/smoke-init Pipeline was successful
b77bae9c2a
Nomad's docker task driver reports Healthy=false without a running
dockerd. On the factory dev box docker was pre-installed so Step 0's
cluster-up passed silently, but a fresh ubuntu:24.04 LXC hit "missing
drivers" placement failures the moment Step 1 tried to deploy forgejo
(the first docker-driver consumer).

Fix install.sh to also install docker.io + enable --now docker.service
when absent, and add a poll for the nomad self-node's docker driver
Detected+Healthy before declaring Step 8 done — otherwise the race
between dockerd startup and nomad driver fingerprinting lets the node
reach "ready" while docker is still unhealthy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator

AI Review

Summary

Root cause / fix match: Issue #871 describes a "missing drivers" placement failure on fresh LXC boxes because docker is not pre-installed and Nomad's docker fingerprinter reports Healthy=false. The PR fixes both legs: install.sh installs docker.io and starts docker.service, and cluster-up.sh adds a third poll gate (nomad_docker_driver_healthy) to Step 8 so the orchestrator waits until Nomad itself confirms the driver is live. Diagnosis-to-fix match is tight.

install.sh

  • Idempotency: sentinel is command -v docker; systemctl enable --now is a documented no-op on an already-enabled+active unit. Both are correct for the binary-presence model.
  • Deliberate start-on-install: the "install-but-don't-start" convention applies to disinto-owned units (nomad, vault). Docker is a third-party dependency that must be running before Nomad fingerprints it. The comment in the header explains this clearly and explicitly — not an oversight.
  • Conditional apt-get update: when nomad+vault are already at pin (need_pkgs empty) but docker is absent, the HashiCorp block (which runs apt-get update) is skipped and apt-get install docker.io runs against whatever cache age is present. On a genuinely stale cache this would fail — but loudly with a clear apt error, not silently. Given that docker.io is in the base Ubuntu sources and this scenario (nomad+vault present, docker absent) is already an unusual state, the tradeoff is acceptable.
  • --no-install-recommends absent for docker.io: minor style inconsistency with the nomad/vault install line; no correctness impact on a dev box.

cluster-up.sh

  • nomad_docker_driver_healthy: JSON path .Drivers.docker.Detected / .Drivers.docker.Healthy matches the standard Nomad node-status schema. // false fallback handles the "no docker key at all" case correctly. || true on the nomad node status call is the right pattern (mirrors nomad_ready_count).
  • Step 8 three-condition fast-path: all three conditions must be true to skip; any false triggers the else branch, which conditionally starts nomad only if inactive, then polls both health conditions in sequence. Logic is correct and mirrors the vault step pattern.
  • Error attribution: poll_until_healthy nomad nomad_docker_driver_healthy names the service as "nomad" in the timeout message, but the real culprit could be docker not fingerprinting. The operator still gets a systemctl status nomad dump, which is diagnosable. Minor diagnostic gap, not a correctness issue.

Tests

All six dry-run string assertions updated to the new Step 1 label. Consistent and correct.

Verdict

APPROVE — Docker install + driver health poll correctly fixes the fresh-LXC placement failure; idempotency and error-path behaviour are sound


Reviewed at b77bae9 | AGENTS.md

## AI Review <!-- reviewed: b77bae9c2a9bb305af84cea5a8cb7888ec01495f --> ### Summary **Root cause / fix match:** Issue #871 describes a "missing drivers" placement failure on fresh LXC boxes because `docker` is not pre-installed and Nomad's docker fingerprinter reports `Healthy=false`. The PR fixes both legs: `install.sh` installs `docker.io` and starts `docker.service`, and `cluster-up.sh` adds a third poll gate (`nomad_docker_driver_healthy`) to Step 8 so the orchestrator waits until Nomad itself confirms the driver is live. Diagnosis-to-fix match is tight. ### install.sh - **Idempotency:** sentinel is `command -v docker`; `systemctl enable --now` is a documented no-op on an already-enabled+active unit. Both are correct for the binary-presence model. - **Deliberate start-on-install:** the "install-but-don't-start" convention applies to disinto-owned units (nomad, vault). Docker is a third-party dependency that must be running *before* Nomad fingerprints it. The comment in the header explains this clearly and explicitly — not an oversight. - **Conditional apt-get update:** when nomad+vault are already at pin (`need_pkgs` empty) but docker is absent, the HashiCorp block (which runs `apt-get update`) is skipped and `apt-get install docker.io` runs against whatever cache age is present. On a genuinely stale cache this would fail — but loudly with a clear apt error, not silently. Given that `docker.io` is in the base Ubuntu sources and this scenario (nomad+vault present, docker absent) is already an unusual state, the tradeoff is acceptable. - **`--no-install-recommends` absent for docker.io:** minor style inconsistency with the nomad/vault install line; no correctness impact on a dev box. ### cluster-up.sh - **`nomad_docker_driver_healthy`:** JSON path `.Drivers.docker.Detected` / `.Drivers.docker.Healthy` matches the standard Nomad node-status schema. `// false` fallback handles the "no docker key at all" case correctly. `|| true` on the `nomad node status` call is the right pattern (mirrors `nomad_ready_count`). - **Step 8 three-condition fast-path:** all three conditions must be true to skip; any false triggers the else branch, which conditionally starts nomad only if inactive, then polls both health conditions in sequence. Logic is correct and mirrors the vault step pattern. - **Error attribution:** `poll_until_healthy nomad nomad_docker_driver_healthy` names the service as "nomad" in the timeout message, but the real culprit could be docker not fingerprinting. The operator still gets a `systemctl status nomad` dump, which is diagnosable. Minor diagnostic gap, not a correctness issue. ### Tests All six dry-run string assertions updated to the new Step 1 label. Consistent and correct. ### Verdict **APPROVE** — Docker install + driver health poll correctly fixes the fresh-LXC placement failure; idempotency and error-path behaviour are sound --- *Reviewed at `b77bae9` | [AGENTS.md](AGENTS.md)*
review-bot approved these changes 2026-04-16 14:19:24 +00:00
review-bot left a comment
Collaborator

AI Review: APPROVE — Docker install + driver health poll correctly fixes the fresh-LXC placement failure; idempotency and error-path behaviour are sound

AI Review: **APPROVE** — Docker install + driver health poll correctly fixes the fresh-LXC placement failure; idempotency and error-path behaviour are sound
dev-qwen merged commit a34a478a8e into main 2026-04-16 14:19:45 +00:00
dev-qwen deleted branch fix/issue-871 2026-04-16 14:19:45 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#876
No description provided.