# Sprint: versioned agent images ## Vision issues - #429 — feat: publish versioned agent images — compose should use image: not build: ## What this enables After this sprint, `disinto init` produces a `docker-compose.yml` that pulls a pinned image from a registry instead of building from source. A new factory instance needs only a token and a config file — no clone, no build, no local Docker context. This closes the gap between "works on my machine" and "one-command bootstrap." It also enables rollback: if agents misbehave after an upgrade, `AGENTS_IMAGE=v0.1.1 disinto up` restores the previous version without touching the codebase. ## What exists today The release pipeline is more complete than it looks: - `formulas/release.toml` — 7-step release formula. Steps 4-5 already build and tag the image locally (`docker compose build --no-cache agents`, `docker tag disinto-agents disinto-agents:$RELEASE_VERSION`). The gap: no push step, no registry target. - `lib/release.sh` — Creates vault TOML and ops repo PR for the release. No image version wired into compose generation. - `lib/generators.sh` `_generate_compose_impl()` — Generates compose with `build: context: . dockerfile: docker/agents/Dockerfile` for agents, runner, reproduce, edge. Version-unaware. - `vault/vault-env.sh` — `DOCKER_HUB_TOKEN` is in `VAULT_ALLOWED_SECRETS`. Not currently used. - `docker/agents/Dockerfile` — No VOLUME declarations; runtime state, repos, and config are mounted via compose but not declared. Claude binary injected by compose at init time. ## Complexity Files touched: 4 - `formulas/release.toml` — add `push-image` step (after tag-image, before restart-agents) - `lib/generators.sh` — `_generate_compose_impl()` reads `AGENTS_IMAGE` env var; emits `image:` when set, falls back to `build:` when not set (dev mode) - `docker/agents/Dockerfile` — add explicit VOLUME declarations for /home/agent/data, /home/agent/repos, /home/agent/disinto/projects, /home/agent/disinto/state - `bin/disinto` `disinto_up()` — pass `AGENTS_IMAGE` through to compose if set in `.env` Subsystems: release formula, compose generation, Dockerfile hygiene Sub-issues: 3 Gluecode ratio: ~80% gluecode (release step, VOLUME declarations), ~20% new (AGENTS_IMAGE env var path) ## Risks - Registry credentials: `DOCKER_HUB_TOKEN` is in vault allowlist but not wired up. The push step needs a registry login — either Docker Hub (DOCKER_HUB_TOKEN) or GHCR (GITHUB_TOKEN, already in vault). The sprint spec must pick one and add the credential to the release vault TOML. - Volume shadow: if VOLUME declarations don't match the compose volume mounts exactly, runtime files land in anonymous volumes instead of named ones. Must test before shipping. - Existing deployments: currently on `build:`. Migration: set AGENTS_IMAGE in .env, re-run `disinto init` (compose is regenerated), restart. No SSH, no worktree needed. - `runner` service: same image as agents, same version. Must update runner service in compose gen too. ## Cost — new infra to maintain - Registry account + token rotation: one vault secret (DOCKER_HUB_TOKEN) needs rotation policy. GHCR (via GITHUB_TOKEN) has no additional account but ties release to GitHub. - Release formula grows from 7 to 8 steps. Small maintenance surface. - `AGENTS_IMAGE` becomes a documented env var in .env for pinned deployments. Needs docs. ## Recommendation Worth it. The release formula is 90% done — one push step closes the gap. The compose generation change is purely additive (AGENTS_IMAGE env var, fallback to build: for dev). Volume declarations are hygiene that should exist regardless of versioning. Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo. No new account needed. ## Side effects of this sprint Beyond versioned images, this sprint indirectly closes one open bug: - **#665 (edge cold-start race)** — `disinto-edge` currently exits with code 128 on a cold `disinto up` because its entrypoint clones from `forgejo:3000` before forgejo's HTTP listener is up. Once edge's image embeds the disinto source at build time (no runtime clone), the race vanishes. The `depends_on: { forgejo: { condition: service_healthy } }` workaround proposed in #665 becomes unnecessary. Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in parallel — it would be churn this sprint deletes anyway. ## What this sprint does not yet enable This sprint delivers versioned images and pinned compose. It is a foundation, not the whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style client boxes — each independently scopable, with the dependency chain noted. ### Follow-up A: `disinto upgrade ` subcommand **Why**: even with versioned images, an operator on a client box still has to coordinate multiple steps to upgrade — `git fetch && git checkout`, edit `.env` to set `AGENTS_IMAGE`, re-run `_generate_compose_impl`, `docker compose pull`, `docker compose up -d --force-recreate`, plus any out-of-band migrations. There is no single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human operation that drifts out of sync. **Shape**: ``` disinto upgrade v0.3.0 ``` Sequence (roughly): 1. `git fetch --tags` and verify the tag exists 2. Bail if the working tree is dirty 3. `git checkout v0.3.0` 4. `_env_set_idempotent AGENTS_IMAGE v0.3.0 .env` (helper from #641) 5. Re-run `_generate_compose_impl` (picks up the new image tag) 6. Run pre-upgrade migration hooks (Follow-up C) 7. `docker compose pull && docker compose up -d --force-recreate` 8. Run post-upgrade migration hooks 9. Health check; rollback to previous version on failure 10. Log result **Files touched**: `bin/disinto` (~150 lines, new `disinto_upgrade()` function), possibly extracted to a new `lib/upgrade.sh` if it grows large enough to warrant separation. **Dependency**: this sprint (needs `AGENTS_IMAGE` to be a real thing in `.env` and in the compose generator). ### Follow-up B: unify `DISINTO_VERSION` and `AGENTS_IMAGE` **Why**: today there are two version concepts in the codebase: - `DISINTO_VERSION` — used at `docker/edge/entrypoint-edge.sh:84` for the in-container source clone (`git clone --branch ${DISINTO_VERSION:-main}`). Defaults to `main`. Also set in the compose generator at `lib/generators.sh:397` for the edge service. - `AGENTS_IMAGE` — proposed by this sprint for the docker image tag in compose. These should be **the same value**. If you are running the `v0.3.0` agents image, the in-container source (if any clone still happens) should also be at `v0.3.0`. Otherwise you get a v0.3.0 binary running against v-something-else source, which is exactly the silent drift versioning is meant to prevent. After this sprint folds source into the image, `DISINTO_VERSION` in containers becomes vestigial. The follow-up: pick one name (probably keep `DISINTO_VERSION` since it is referenced in more places), have `_generate_compose_impl` set both `image:` and the env var from the same source, and delete the redundant runtime clone block in `entrypoint-edge.sh`. **Files touched**: `lib/generators.sh`, `docker/edge/entrypoint-edge.sh` (delete the runtime clone block once the image carries source), possibly `lib/env.sh` for the default value. **Dependency**: this sprint. ### Follow-up C: migration framework for breaking changes **Why**: some upgrades have side effects beyond "new code in the container": - The CLAUDE_CONFIG_DIR migration (#641 → `setup_claude_config_dir` in `lib/claude-config.sh`) needs a one-time `mkdir + mv + symlink` per host. - The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume URL repair. - Future: schema changes in the vault, ops repo restructures, env var renames. There is no `disinto/migrations/v0.3.0.sh` style framework. Existing migrations live ad-hoc inside `disinto init` and run unconditionally on init. That works for fresh installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations v0.2.1 → v0.2.2 → v0.3.0 to run in order". **Shape**: a `migrations/` directory with one file per version (`v0.3.0.sh`, `v0.3.1.sh`, …). `disinto upgrade` (Follow-up A) invokes each migration file in order between the previous applied version and the target. Track the applied version in `.env` (e.g. `DISINTO_LAST_MIGRATION=v0.3.0`) or in `state/`. Standard rails/django/flyway pattern. The framework itself is small; the value is in having a place for migrations to live so they are not scattered through `disinto init` and lost in code review. **Files touched**: `lib/upgrade.sh` (the upgrade command is the natural caller), new `migrations/` directory, a tracking key in `.env` for the last applied migration version. **Dependency**: Follow-up A (the upgrade command is the natural caller). ### Follow-up D: bootstrap-from-broken-state runbook **Why**: this sprint and Follow-ups A–C describe the steady-state upgrade flow. But existing client boxes — harb-dev-box specifically — are not in steady state. harb's working tree is at tag `v0.2.0` (months behind main). Its containers are running locally built `:latest` images of unknown vintage. Some host-level state (`CLAUDE_CONFIG_DIR`, `~/.git/config` credential helper from the disinto-dev-box rollout) has not been applied on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there is too much drift. Each existing client box needs a **one-time manual reset** to a known-good baseline before the versioned upgrade flow takes over. The reset is mechanical but not automatable — it touches host-level state that pre-dates the new flow. **Shape**: a documented runbook at `docs/client-box-bootstrap.md` (or similar) that walks operators through the one-time reset: 1. `disinto down` 2. `git fetch --all && git checkout ` on the working tree 3. Apply host-level migrations: - `setup_claude_config_dir true` (from `lib/claude-config.sh`, added in #641) - Strip embedded creds from `.git/config`'s forgejo remote and add the inline credential helper using the pattern from #669 - Rotate `FORGE_PASS` and `FORGE_TOKEN` if they have leaked (separate decision) 4. Rebuild images (`docker compose build`) or pull from registry once this sprint lands 5. `disinto up` 6. Verify with `disinto status` and a smoke fetch through the credential helper After the reset, the box is in a known-good baseline and `disinto upgrade ` takes over for all subsequent upgrades. The runbook documents this as the only manual operation an operator should ever have to perform on a client box. **Files touched**: new `docs/client-box-bootstrap.md`. Optionally a small change to `disinto init` to detect "this looks like a stale-state box that needs the reset runbook, not a fresh init" and refuse with a pointer to the runbook. **Dependency**: none (can be done in parallel with this sprint and the others). ## Updated recommendation The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over Docker Hub. Layered on top: - **Sequence the four follow-ups**: A (upgrade subcommand) and D (bootstrap runbook) are independent of this sprint's image work and can land in parallel. B (version unification) is small cleanup that depends on this sprint. C (migration framework) can wait until the first migration that actually needs it — `setup_claude_config_dir` doesn't, since it already lives in `disinto init`. - **Do not fix #665 in parallel**: as noted in "Side effects", this sprint deletes the cause. A `depends_on: service_healthy` workaround applied to edge in parallel would be wasted work. - **Do not file separate forge issues for the follow-ups until this sprint is broken into sub-issues**: keep them in this document until the architect (or the operator) is ready to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay reorderable as the sprint shape evolves.