Enriches the architect's existing sprint plan with: 1. Side effects: this sprint indirectly closes #665 (edge cold-start race) by removing the runtime clone — flagging so a parallel #665 fix isn't applied. 2. Four follow-up sprints that complete the client-box upgrade story: - A: 'disinto upgrade <version>' subcommand for atomic client-side upgrades - B: unify DISINTO_VERSION and AGENTS_IMAGE into one version concept - C: migration framework for breaking changes (per-version migration files) - D: bootstrap-from-broken-state runbook for existing drifted boxes (harb) 3. Updated recommendation that sequences the follow-ups against this sprint and notes #665 should not be fixed in parallel. The original sprint scope (4 files, ~80% gluecode, GHCR) is unchanged and remains tightly scoped. The follow-ups are deliberately kept inside this document rather than filed as separate forge issues until the sprint plan is ready to be broken into sub-issues by the architect.
12 KiB
Sprint: versioned agent images
Vision issues
- #429 — feat: publish versioned agent images — compose should use image: not build:
What this enables
After this sprint, disinto init produces a docker-compose.yml that pulls a pinned image
from a registry instead of building from source. A new factory instance needs only a token
and a config file — no clone, no build, no local Docker context. This closes the gap between
"works on my machine" and "one-command bootstrap."
It also enables rollback: if agents misbehave after an upgrade, AGENTS_IMAGE=v0.1.1 disinto up
restores the previous version without touching the codebase.
What exists today
The release pipeline is more complete than it looks:
formulas/release.toml— 7-step release formula. Steps 4-5 already build and tag the image locally (docker compose build --no-cache agents,docker tag disinto-agents disinto-agents:$RELEASE_VERSION). The gap: no push step, no registry target.lib/release.sh— Creates vault TOML and ops repo PR for the release. No image version wired into compose generation.lib/generators.sh_generate_compose_impl()— Generates compose withbuild: context: . dockerfile: docker/agents/Dockerfilefor agents, runner, reproduce, edge. Version-unaware.vault/vault-env.sh—DOCKER_HUB_TOKENis inVAULT_ALLOWED_SECRETS. Not currently used.docker/agents/Dockerfile— No VOLUME declarations; runtime state, repos, and config are mounted via compose but not declared. Claude binary injected by compose at init time.
Complexity
Files touched: 4
formulas/release.toml— addpush-imagestep (after tag-image, before restart-agents)lib/generators.sh—_generate_compose_impl()readsAGENTS_IMAGEenv var; emitsimage:when set, falls back tobuild:when not set (dev mode)docker/agents/Dockerfile— add explicit VOLUME declarations for /home/agent/data, /home/agent/repos, /home/agent/disinto/projects, /home/agent/disinto/statebin/disintodisinto_up()— passAGENTS_IMAGEthrough to compose if set in.env
Subsystems: release formula, compose generation, Dockerfile hygiene Sub-issues: 3 Gluecode ratio: ~80% gluecode (release step, VOLUME declarations), ~20% new (AGENTS_IMAGE env var path)
Risks
- Registry credentials:
DOCKER_HUB_TOKENis in vault allowlist but not wired up. The push step needs a registry login — either Docker Hub (DOCKER_HUB_TOKEN) or GHCR (GITHUB_TOKEN, already in vault). The sprint spec must pick one and add the credential to the release vault TOML. - Volume shadow: if VOLUME declarations don't match the compose volume mounts exactly, runtime files land in anonymous volumes instead of named ones. Must test before shipping.
- Existing deployments: currently on
build:. Migration: set AGENTS_IMAGE in .env, re-rundisinto init(compose is regenerated), restart. No SSH, no worktree needed. runnerservice: same image as agents, same version. Must update runner service in compose gen too.
Cost — new infra to maintain
- Registry account + token rotation: one vault secret (DOCKER_HUB_TOKEN) needs rotation policy. GHCR (via GITHUB_TOKEN) has no additional account but ties release to GitHub.
- Release formula grows from 7 to 8 steps. Small maintenance surface.
AGENTS_IMAGEbecomes a documented env var in .env for pinned deployments. Needs docs.
Recommendation
Worth it. The release formula is 90% done — one push step closes the gap. The compose generation change is purely additive (AGENTS_IMAGE env var, fallback to build: for dev). Volume declarations are hygiene that should exist regardless of versioning.
Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo. No new account needed.
Side effects of this sprint
Beyond versioned images, this sprint indirectly closes one open bug:
-
#665 (edge cold-start race) —
disinto-edgecurrently exits with code 128 on a colddisinto upbecause its entrypoint clones fromforgejo:3000before forgejo's HTTP listener is up. Once edge's image embeds the disinto source at build time (no runtime clone), the race vanishes. Thedepends_on: { forgejo: { condition: service_healthy } }workaround proposed in #665 becomes unnecessary.Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in parallel — it would be churn this sprint deletes anyway.
What this sprint does not yet enable
This sprint delivers versioned images and pinned compose. It is a foundation, not the whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style client boxes — each independently scopable, with the dependency chain noted.
Follow-up A: disinto upgrade <version> subcommand
Why: even with versioned images, an operator on a client box still has to coordinate
multiple steps to upgrade — git fetch && git checkout, edit .env to set
AGENTS_IMAGE, re-run _generate_compose_impl, docker compose pull,
docker compose up -d --force-recreate, plus any out-of-band migrations. There is no
single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human
operation that drifts out of sync.
Shape:
disinto upgrade v0.3.0
Sequence (roughly):
git fetch --tagsand verify the tag exists- Bail if the working tree is dirty
git checkout v0.3.0_env_set_idempotent AGENTS_IMAGE v0.3.0 .env(helper from #641)- Re-run
_generate_compose_impl(picks up the new image tag) - Run pre-upgrade migration hooks (Follow-up C)
docker compose pull && docker compose up -d --force-recreate- Run post-upgrade migration hooks
- Health check; rollback to previous version on failure
- Log result
Files touched: bin/disinto (~150 lines, new disinto_upgrade() function), possibly
extracted to a new lib/upgrade.sh if it grows large enough to warrant separation.
Dependency: this sprint (needs AGENTS_IMAGE to be a real thing in .env and in the
compose generator).
Follow-up B: unify DISINTO_VERSION and AGENTS_IMAGE
Why: today there are two version concepts in the codebase:
DISINTO_VERSION— used atdocker/edge/entrypoint-edge.sh:84for the in-container source clone (git clone --branch ${DISINTO_VERSION:-main}). Defaults tomain. Also set in the compose generator atlib/generators.sh:397for the edge service.AGENTS_IMAGE— proposed by this sprint for the docker image tag in compose.
These should be the same value. If you are running the v0.3.0 agents image, the
in-container source (if any clone still happens) should also be at v0.3.0. Otherwise
you get a v0.3.0 binary running against v-something-else source, which is exactly the
silent drift versioning is meant to prevent.
After this sprint folds source into the image, DISINTO_VERSION in containers becomes
vestigial. The follow-up: pick one name (probably keep DISINTO_VERSION since it is
referenced in more places), have _generate_compose_impl set both image: and the env
var from the same source, and delete the redundant runtime clone block in
entrypoint-edge.sh.
Files touched: lib/generators.sh, docker/edge/entrypoint-edge.sh (delete the
runtime clone block once the image carries source), possibly lib/env.sh for the
default value.
Dependency: this sprint.
Follow-up C: migration framework for breaking changes
Why: some upgrades have side effects beyond "new code in the container":
- The CLAUDE_CONFIG_DIR migration (#641 →
setup_claude_config_dirinlib/claude-config.sh) needs a one-timemkdir + mv + symlinkper host. - The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume URL repair.
- Future: schema changes in the vault, ops repo restructures, env var renames.
There is no disinto/migrations/v0.3.0.sh style framework. Existing migrations live
ad-hoc inside disinto init and run unconditionally on init. That works for fresh
installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations
v0.2.1 → v0.2.2 → v0.3.0 to run in order".
Shape: a migrations/ directory with one file per version (v0.3.0.sh,
v0.3.1.sh, …). disinto upgrade (Follow-up A) invokes each migration file in order
between the previous applied version and the target. Track the applied version in
.env (e.g. DISINTO_LAST_MIGRATION=v0.3.0) or in state/. Standard
rails/django/flyway pattern. The framework itself is small; the value is in having a
place for migrations to live so they are not scattered through disinto init and lost
in code review.
Files touched: lib/upgrade.sh (the upgrade command is the natural caller), new
migrations/ directory, a tracking key in .env for the last applied migration
version.
Dependency: Follow-up A (the upgrade command is the natural caller).
Follow-up D: bootstrap-from-broken-state runbook
Why: this sprint and Follow-ups A–C describe the steady-state upgrade flow. But
existing client boxes — harb-dev-box specifically — are not in steady state. harb's
working tree is at tag v0.2.0 (months behind main). Its containers are running locally
built :latest images of unknown vintage. Some host-level state (CLAUDE_CONFIG_DIR,
~/.git/config credential helper from the disinto-dev-box rollout) has not been applied
on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there
is too much drift.
Each existing client box needs a one-time manual reset to a known-good baseline before the versioned upgrade flow takes over. The reset is mechanical but not automatable — it touches host-level state that pre-dates the new flow.
Shape: a documented runbook at docs/client-box-bootstrap.md (or similar) that
walks operators through the one-time reset:
disinto downgit fetch --all && git checkout <latest tag>on the working tree- Apply host-level migrations:
setup_claude_config_dir true(fromlib/claude-config.sh, added in #641)- Strip embedded creds from
.git/config's forgejo remote and add the inline credential helper using the pattern from #669 - Rotate
FORGE_PASSandFORGE_TOKENif they have leaked (separate decision)
- Rebuild images (
docker compose build) or pull from registry once this sprint lands disinto up- Verify with
disinto statusand a smoke fetch through the credential helper
After the reset, the box is in a known-good baseline and disinto upgrade <version>
takes over for all subsequent upgrades. The runbook documents this as the only manual
operation an operator should ever have to perform on a client box.
Files touched: new docs/client-box-bootstrap.md. Optionally a small change to
disinto init to detect "this looks like a stale-state box that needs the reset
runbook, not a fresh init" and refuse with a pointer to the runbook.
Dependency: none (can be done in parallel with this sprint and the others).
Updated recommendation
The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over Docker Hub. Layered on top:
- Sequence the four follow-ups: A (upgrade subcommand) and D (bootstrap runbook) are
independent of this sprint's image work and can land in parallel. B (version
unification) is small cleanup that depends on this sprint. C (migration framework) can
wait until the first migration that actually needs it —
setup_claude_config_dirdoesn't, since it already lives indisinto init. - Do not fix #665 in parallel: as noted in "Side effects", this sprint deletes the
cause. A
depends_on: service_healthyworkaround applied to edge in parallel would be wasted work. - Do not file separate forge issues for the follow-ups until this sprint is broken into sub-issues: keep them in this document until the architect (or the operator) is ready to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay reorderable as the sprint shape evolves.