dev-bot 3a172bcc86 sprint(versioned-agent-images): add side-effects, four follow-up sprints, updated recommendation

Enriches the architect's existing sprint plan with:

1. Side effects: this sprint indirectly closes #665 (edge cold-start race) by
   removing the runtime clone — flagging so a parallel #665 fix isn't applied.

2. Four follow-up sprints that complete the client-box upgrade story:
   - A: 'disinto upgrade <version>' subcommand for atomic client-side upgrades
   - B: unify DISINTO_VERSION and AGENTS_IMAGE into one version concept
   - C: migration framework for breaking changes (per-version migration files)
   - D: bootstrap-from-broken-state runbook for existing drifted boxes (harb)

3. Updated recommendation that sequences the follow-ups against this sprint
   and notes #665 should not be fixed in parallel.

The original sprint scope (4 files, ~80% gluecode, GHCR) is unchanged and
remains tightly scoped. The follow-ups are deliberately kept inside this
document rather than filed as separate forge issues until the sprint plan is
ready to be broken into sub-issues by the architect.

2026-04-11 10:09:57 +00:00

12 KiB

Raw Blame History

Sprint: versioned agent images

Vision issues

#429 — feat: publish versioned agent images — compose should use image: not build:

What this enables

After this sprint, disinto init produces a docker-compose.yml that pulls a pinned image from a registry instead of building from source. A new factory instance needs only a token and a config file — no clone, no build, no local Docker context. This closes the gap between "works on my machine" and "one-command bootstrap."

It also enables rollback: if agents misbehave after an upgrade, AGENTS_IMAGE=v0.1.1 disinto up restores the previous version without touching the codebase.

What exists today

The release pipeline is more complete than it looks:

formulas/release.toml — 7-step release formula. Steps 4-5 already build and tag the image locally (docker compose build --no-cache agents, docker tag disinto-agents disinto-agents:$RELEASE_VERSION). The gap: no push step, no registry target.
lib/release.sh — Creates vault TOML and ops repo PR for the release. No image version wired into compose generation.
lib/generators.sh _generate_compose_impl() — Generates compose with build: context: . dockerfile: docker/agents/Dockerfile for agents, runner, reproduce, edge. Version-unaware.
vault/vault-env.sh — DOCKER_HUB_TOKEN is in VAULT_ALLOWED_SECRETS. Not currently used.
docker/agents/Dockerfile — No VOLUME declarations; runtime state, repos, and config are mounted via compose but not declared. Claude binary injected by compose at init time.

Complexity

Files touched: 4

formulas/release.toml — add push-image step (after tag-image, before restart-agents)
lib/generators.sh — _generate_compose_impl() reads AGENTS_IMAGE env var; emits image: when set, falls back to build: when not set (dev mode)
docker/agents/Dockerfile — add explicit VOLUME declarations for /home/agent/data, /home/agent/repos, /home/agent/disinto/projects, /home/agent/disinto/state
bin/disinto disinto_up() — pass AGENTS_IMAGE through to compose if set in .env

Subsystems: release formula, compose generation, Dockerfile hygiene Sub-issues: 3 Gluecode ratio: ~80% gluecode (release step, VOLUME declarations), ~20% new (AGENTS_IMAGE env var path)

Risks

Registry credentials: DOCKER_HUB_TOKEN is in vault allowlist but not wired up. The push step needs a registry login — either Docker Hub (DOCKER_HUB_TOKEN) or GHCR (GITHUB_TOKEN, already in vault). The sprint spec must pick one and add the credential to the release vault TOML.
Volume shadow: if VOLUME declarations don't match the compose volume mounts exactly, runtime files land in anonymous volumes instead of named ones. Must test before shipping.
Existing deployments: currently on build:. Migration: set AGENTS_IMAGE in .env, re-run disinto init (compose is regenerated), restart. No SSH, no worktree needed.
runner service: same image as agents, same version. Must update runner service in compose gen too.

Cost — new infra to maintain

Registry account + token rotation: one vault secret (DOCKER_HUB_TOKEN) needs rotation policy. GHCR (via GITHUB_TOKEN) has no additional account but ties release to GitHub.
Release formula grows from 7 to 8 steps. Small maintenance surface.
AGENTS_IMAGE becomes a documented env var in .env for pinned deployments. Needs docs.

Recommendation

Worth it. The release formula is 90% done — one push step closes the gap. The compose generation change is purely additive (AGENTS_IMAGE env var, fallback to build: for dev). Volume declarations are hygiene that should exist regardless of versioning.

Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo. No new account needed.

Side effects of this sprint

Beyond versioned images, this sprint indirectly closes one open bug:

#665 (edge cold-start race) — disinto-edge currently exits with code 128 on a cold disinto up because its entrypoint clones from forgejo:3000 before forgejo's HTTP listener is up. Once edge's image embeds the disinto source at build time (no runtime clone), the race vanishes. The depends_on: { forgejo: { condition: service_healthy } } workaround proposed in #665 becomes unnecessary.

Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in parallel — it would be churn this sprint deletes anyway.

What this sprint does not yet enable

This sprint delivers versioned images and pinned compose. It is a foundation, not the whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style client boxes — each independently scopable, with the dependency chain noted.

Follow-up A: `disinto upgrade <version>` subcommand

Why: even with versioned images, an operator on a client box still has to coordinate multiple steps to upgrade — git fetch && git checkout, edit .env to set AGENTS_IMAGE, re-run _generate_compose_impl, docker compose pull, docker compose up -d --force-recreate, plus any out-of-band migrations. There is no single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human operation that drifts out of sync.

Shape:

disinto upgrade v0.3.0

Sequence (roughly):

git fetch --tags and verify the tag exists
Bail if the working tree is dirty
git checkout v0.3.0
_env_set_idempotent AGENTS_IMAGE v0.3.0 .env (helper from #641)
Re-run _generate_compose_impl (picks up the new image tag)
Run pre-upgrade migration hooks (Follow-up C)
docker compose pull && docker compose up -d --force-recreate
Run post-upgrade migration hooks
Health check; rollback to previous version on failure
Log result

Files touched: bin/disinto (~150 lines, new disinto_upgrade() function), possibly extracted to a new lib/upgrade.sh if it grows large enough to warrant separation.

Dependency: this sprint (needs AGENTS_IMAGE to be a real thing in .env and in the compose generator).

Follow-up B: unify `DISINTO_VERSION` and `AGENTS_IMAGE`

Why: today there are two version concepts in the codebase:

DISINTO_VERSION — used at docker/edge/entrypoint-edge.sh:84 for the in-container source clone (git clone --branch ${DISINTO_VERSION:-main}). Defaults to main. Also set in the compose generator at lib/generators.sh:397 for the edge service.
AGENTS_IMAGE — proposed by this sprint for the docker image tag in compose.

These should be the same value. If you are running the v0.3.0 agents image, the in-container source (if any clone still happens) should also be at v0.3.0. Otherwise you get a v0.3.0 binary running against v-something-else source, which is exactly the silent drift versioning is meant to prevent.

After this sprint folds source into the image, DISINTO_VERSION in containers becomes vestigial. The follow-up: pick one name (probably keep DISINTO_VERSION since it is referenced in more places), have _generate_compose_impl set both image: and the env var from the same source, and delete the redundant runtime clone block in entrypoint-edge.sh.

Files touched: lib/generators.sh, docker/edge/entrypoint-edge.sh (delete the runtime clone block once the image carries source), possibly lib/env.sh for the default value.

Dependency: this sprint.

Follow-up C: migration framework for breaking changes

Why: some upgrades have side effects beyond "new code in the container":

The CLAUDE_CONFIG_DIR migration (#641 → setup_claude_config_dir in lib/claude-config.sh) needs a one-time mkdir + mv + symlink per host.
The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume URL repair.
Future: schema changes in the vault, ops repo restructures, env var renames.

There is no disinto/migrations/v0.3.0.sh style framework. Existing migrations live ad-hoc inside disinto init and run unconditionally on init. That works for fresh installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations v0.2.1 → v0.2.2 → v0.3.0 to run in order".

Shape: a migrations/ directory with one file per version (v0.3.0.sh, v0.3.1.sh, …). disinto upgrade (Follow-up A) invokes each migration file in order between the previous applied version and the target. Track the applied version in .env (e.g. DISINTO_LAST_MIGRATION=v0.3.0) or in state/. Standard rails/django/flyway pattern. The framework itself is small; the value is in having a place for migrations to live so they are not scattered through disinto init and lost in code review.

Files touched: lib/upgrade.sh (the upgrade command is the natural caller), new migrations/ directory, a tracking key in .env for the last applied migration version.

Dependency: Follow-up A (the upgrade command is the natural caller).

Follow-up D: bootstrap-from-broken-state runbook

Why: this sprint and Follow-ups A–C describe the steady-state upgrade flow. But existing client boxes — harb-dev-box specifically — are not in steady state. harb's working tree is at tag v0.2.0 (months behind main). Its containers are running locally built :latest images of unknown vintage. Some host-level state (CLAUDE_CONFIG_DIR, ~/.git/config credential helper from the disinto-dev-box rollout) has not been applied on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there is too much drift.

Each existing client box needs a one-time manual reset to a known-good baseline before the versioned upgrade flow takes over. The reset is mechanical but not automatable — it touches host-level state that pre-dates the new flow.

Shape: a documented runbook at docs/client-box-bootstrap.md (or similar) that walks operators through the one-time reset:

disinto down
git fetch --all && git checkout <latest tag> on the working tree
Apply host-level migrations:
- setup_claude_config_dir true (from lib/claude-config.sh, added in #641)
- Strip embedded creds from .git/config's forgejo remote and add the inline credential helper using the pattern from #669
- Rotate FORGE_PASS and FORGE_TOKEN if they have leaked (separate decision)
Rebuild images (docker compose build) or pull from registry once this sprint lands
disinto up
Verify with disinto status and a smoke fetch through the credential helper

After the reset, the box is in a known-good baseline and disinto upgrade <version> takes over for all subsequent upgrades. The runbook documents this as the only manual operation an operator should ever have to perform on a client box.

Files touched: new docs/client-box-bootstrap.md. Optionally a small change to disinto init to detect "this looks like a stale-state box that needs the reset runbook, not a fresh init" and refuse with a pointer to the runbook.

Dependency: none (can be done in parallel with this sprint and the others).

Updated recommendation

The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over Docker Hub. Layered on top:

Sequence the four follow-ups: A (upgrade subcommand) and D (bootstrap runbook) are independent of this sprint's image work and can land in parallel. B (version unification) is small cleanup that depends on this sprint. C (migration framework) can wait until the first migration that actually needs it — setup_claude_config_dir doesn't, since it already lives in disinto init.
Do not fix #665 in parallel: as noted in "Side effects", this sprint deletes the cause. A depends_on: service_healthy workaround applied to edge in parallel would be wasted work.
Do not file separate forge issues for the follow-ups until this sprint is broken into sub-issues: keep them in this document until the architect (or the operator) is ready to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay reorderable as the sprint shape evolves.

12 KiB Raw Blame History Unescape Escape