disinto-ops/sprints/versioned-agent-images.md
dev-bot 3a172bcc86 sprint(versioned-agent-images): add side-effects, four follow-up sprints, updated recommendation
Enriches the architect's existing sprint plan with:

1. Side effects: this sprint indirectly closes #665 (edge cold-start race) by
   removing the runtime clone — flagging so a parallel #665 fix isn't applied.

2. Four follow-up sprints that complete the client-box upgrade story:
   - A: 'disinto upgrade <version>' subcommand for atomic client-side upgrades
   - B: unify DISINTO_VERSION and AGENTS_IMAGE into one version concept
   - C: migration framework for breaking changes (per-version migration files)
   - D: bootstrap-from-broken-state runbook for existing drifted boxes (harb)

3. Updated recommendation that sequences the follow-ups against this sprint
   and notes #665 should not be fixed in parallel.

The original sprint scope (4 files, ~80% gluecode, GHCR) is unchanged and
remains tightly scoped. The follow-ups are deliberately kept inside this
document rather than filed as separate forge issues until the sprint plan is
ready to be broken into sub-issues by the architect.
2026-04-11 10:09:57 +00:00

233 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Sprint: versioned agent images
## Vision issues
- #429 — feat: publish versioned agent images — compose should use image: not build:
## What this enables
After this sprint, `disinto init` produces a `docker-compose.yml` that pulls a pinned image
from a registry instead of building from source. A new factory instance needs only a token
and a config file — no clone, no build, no local Docker context. This closes the gap between
"works on my machine" and "one-command bootstrap."
It also enables rollback: if agents misbehave after an upgrade, `AGENTS_IMAGE=v0.1.1 disinto up`
restores the previous version without touching the codebase.
## What exists today
The release pipeline is more complete than it looks:
- `formulas/release.toml` — 7-step release formula. Steps 4-5 already build and tag the image
locally (`docker compose build --no-cache agents`, `docker tag disinto-agents disinto-agents:$RELEASE_VERSION`).
The gap: no push step, no registry target.
- `lib/release.sh` — Creates vault TOML and ops repo PR for the release. No image version wired
into compose generation.
- `lib/generators.sh` `_generate_compose_impl()` — Generates compose with `build: context: .
dockerfile: docker/agents/Dockerfile` for agents, runner, reproduce, edge. Version-unaware.
- `vault/vault-env.sh` — `DOCKER_HUB_TOKEN` is in `VAULT_ALLOWED_SECRETS`. Not currently used.
- `docker/agents/Dockerfile` — No VOLUME declarations; runtime state, repos, and config are
mounted via compose but not declared. Claude binary injected by compose at init time.
## Complexity
Files touched: 4
- `formulas/release.toml` — add `push-image` step (after tag-image, before restart-agents)
- `lib/generators.sh` — `_generate_compose_impl()` reads `AGENTS_IMAGE` env var; emits
`image:` when set, falls back to `build:` when not set (dev mode)
- `docker/agents/Dockerfile` — add explicit VOLUME declarations for /home/agent/data,
/home/agent/repos, /home/agent/disinto/projects, /home/agent/disinto/state
- `bin/disinto` `disinto_up()` — pass `AGENTS_IMAGE` through to compose if set in `.env`
Subsystems: release formula, compose generation, Dockerfile hygiene
Sub-issues: 3
Gluecode ratio: ~80% gluecode (release step, VOLUME declarations), ~20% new (AGENTS_IMAGE env var path)
## Risks
- Registry credentials: `DOCKER_HUB_TOKEN` is in vault allowlist but not wired up. The push step
needs a registry login — either Docker Hub (DOCKER_HUB_TOKEN) or GHCR (GITHUB_TOKEN, already
in vault). The sprint spec must pick one and add the credential to the release vault TOML.
- Volume shadow: if VOLUME declarations don't match the compose volume mounts exactly, runtime
files land in anonymous volumes instead of named ones. Must test before shipping.
- Existing deployments: currently on `build:`. Migration: set AGENTS_IMAGE in .env, re-run
`disinto init` (compose is regenerated), restart. No SSH, no worktree needed.
- `runner` service: same image as agents, same version. Must update runner service in compose gen too.
## Cost — new infra to maintain
- Registry account + token rotation: one vault secret (DOCKER_HUB_TOKEN) needs rotation policy.
GHCR (via GITHUB_TOKEN) has no additional account but ties release to GitHub.
- Release formula grows from 7 to 8 steps. Small maintenance surface.
- `AGENTS_IMAGE` becomes a documented env var in .env for pinned deployments. Needs docs.
## Recommendation
Worth it. The release formula is 90% done — one push step closes the gap. The compose
generation change is purely additive (AGENTS_IMAGE env var, fallback to build: for dev).
Volume declarations are hygiene that should exist regardless of versioning.
Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo.
No new account needed.
## Side effects of this sprint
Beyond versioned images, this sprint indirectly closes one open bug:
- **#665 (edge cold-start race)** — `disinto-edge` currently exits with code 128 on a cold
`disinto up` because its entrypoint clones from `forgejo:3000` before forgejo's HTTP
listener is up. Once edge's image embeds the disinto source at build time (no runtime
clone), the race vanishes. The `depends_on: { forgejo: { condition: service_healthy } }`
workaround proposed in #665 becomes unnecessary.
Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in
parallel — it would be churn this sprint deletes anyway.
## What this sprint does not yet enable
This sprint delivers versioned images and pinned compose. It is a foundation, not the
whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style
client boxes — each independently scopable, with the dependency chain noted.
### Follow-up A: `disinto upgrade <version>` subcommand
**Why**: even with versioned images, an operator on a client box still has to coordinate
multiple steps to upgrade — `git fetch && git checkout`, edit `.env` to set
`AGENTS_IMAGE`, re-run `_generate_compose_impl`, `docker compose pull`,
`docker compose up -d --force-recreate`, plus any out-of-band migrations. There is no
single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human
operation that drifts out of sync.
**Shape**:
```
disinto upgrade v0.3.0
```
Sequence (roughly):
1. `git fetch --tags` and verify the tag exists
2. Bail if the working tree is dirty
3. `git checkout v0.3.0`
4. `_env_set_idempotent AGENTS_IMAGE v0.3.0 .env` (helper from #641)
5. Re-run `_generate_compose_impl` (picks up the new image tag)
6. Run pre-upgrade migration hooks (Follow-up C)
7. `docker compose pull && docker compose up -d --force-recreate`
8. Run post-upgrade migration hooks
9. Health check; rollback to previous version on failure
10. Log result
**Files touched**: `bin/disinto` (~150 lines, new `disinto_upgrade()` function), possibly
extracted to a new `lib/upgrade.sh` if it grows large enough to warrant separation.
**Dependency**: this sprint (needs `AGENTS_IMAGE` to be a real thing in `.env` and in the
compose generator).
### Follow-up B: unify `DISINTO_VERSION` and `AGENTS_IMAGE`
**Why**: today there are two version concepts in the codebase:
- `DISINTO_VERSION` — used at `docker/edge/entrypoint-edge.sh:84` for the in-container
source clone (`git clone --branch ${DISINTO_VERSION:-main}`). Defaults to `main`. Also
set in the compose generator at `lib/generators.sh:397` for the edge service.
- `AGENTS_IMAGE` — proposed by this sprint for the docker image tag in compose.
These should be **the same value**. If you are running the `v0.3.0` agents image, the
in-container source (if any clone still happens) should also be at `v0.3.0`. Otherwise
you get a v0.3.0 binary running against v-something-else source, which is exactly the
silent drift versioning is meant to prevent.
After this sprint folds source into the image, `DISINTO_VERSION` in containers becomes
vestigial. The follow-up: pick one name (probably keep `DISINTO_VERSION` since it is
referenced in more places), have `_generate_compose_impl` set both `image:` and the env
var from the same source, and delete the redundant runtime clone block in
`entrypoint-edge.sh`.
**Files touched**: `lib/generators.sh`, `docker/edge/entrypoint-edge.sh` (delete the
runtime clone block once the image carries source), possibly `lib/env.sh` for the
default value.
**Dependency**: this sprint.
### Follow-up C: migration framework for breaking changes
**Why**: some upgrades have side effects beyond "new code in the container":
- The CLAUDE_CONFIG_DIR migration (#641 → `setup_claude_config_dir` in
`lib/claude-config.sh`) needs a one-time `mkdir + mv + symlink` per host.
- The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume
URL repair.
- Future: schema changes in the vault, ops repo restructures, env var renames.
There is no `disinto/migrations/v0.3.0.sh` style framework. Existing migrations live
ad-hoc inside `disinto init` and run unconditionally on init. That works for fresh
installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations
v0.2.1 → v0.2.2 → v0.3.0 to run in order".
**Shape**: a `migrations/` directory with one file per version (`v0.3.0.sh`,
`v0.3.1.sh`, …). `disinto upgrade` (Follow-up A) invokes each migration file in order
between the previous applied version and the target. Track the applied version in
`.env` (e.g. `DISINTO_LAST_MIGRATION=v0.3.0`) or in `state/`. Standard
rails/django/flyway pattern. The framework itself is small; the value is in having a
place for migrations to live so they are not scattered through `disinto init` and lost
in code review.
**Files touched**: `lib/upgrade.sh` (the upgrade command is the natural caller), new
`migrations/` directory, a tracking key in `.env` for the last applied migration
version.
**Dependency**: Follow-up A (the upgrade command is the natural caller).
### Follow-up D: bootstrap-from-broken-state runbook
**Why**: this sprint and Follow-ups AC describe the steady-state upgrade flow. But
existing client boxes — harb-dev-box specifically — are not in steady state. harb's
working tree is at tag `v0.2.0` (months behind main). Its containers are running locally
built `:latest` images of unknown vintage. Some host-level state (`CLAUDE_CONFIG_DIR`,
`~/.git/config` credential helper from the disinto-dev-box rollout) has not been applied
on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there
is too much drift.
Each existing client box needs a **one-time manual reset** to a known-good baseline
before the versioned upgrade flow takes over. The reset is mechanical but not
automatable — it touches host-level state that pre-dates the new flow.
**Shape**: a documented runbook at `docs/client-box-bootstrap.md` (or similar) that
walks operators through the one-time reset:
1. `disinto down`
2. `git fetch --all && git checkout <latest tag>` on the working tree
3. Apply host-level migrations:
- `setup_claude_config_dir true` (from `lib/claude-config.sh`, added in #641)
- Strip embedded creds from `.git/config`'s forgejo remote and add the inline
credential helper using the pattern from #669
- Rotate `FORGE_PASS` and `FORGE_TOKEN` if they have leaked (separate decision)
4. Rebuild images (`docker compose build`) or pull from registry once this sprint lands
5. `disinto up`
6. Verify with `disinto status` and a smoke fetch through the credential helper
After the reset, the box is in a known-good baseline and `disinto upgrade <version>`
takes over for all subsequent upgrades. The runbook documents this as the only manual
operation an operator should ever have to perform on a client box.
**Files touched**: new `docs/client-box-bootstrap.md`. Optionally a small change to
`disinto init` to detect "this looks like a stale-state box that needs the reset
runbook, not a fresh init" and refuse with a pointer to the runbook.
**Dependency**: none (can be done in parallel with this sprint and the others).
## Updated recommendation
The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over
Docker Hub. Layered on top:
- **Sequence the four follow-ups**: A (upgrade subcommand) and D (bootstrap runbook) are
independent of this sprint's image work and can land in parallel. B (version
unification) is small cleanup that depends on this sprint. C (migration framework) can
wait until the first migration that actually needs it — `setup_claude_config_dir`
doesn't, since it already lives in `disinto init`.
- **Do not fix #665 in parallel**: as noted in "Side effects", this sprint deletes the
cause. A `depends_on: service_healthy` workaround applied to edge in parallel would be
wasted work.
- **Do not file separate forge issues for the follow-ups until this sprint is broken into
sub-issues**: keep them in this document until the architect (or the operator) is ready
to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay
reorderable as the sprint shape evolves.