disinto-ops/sprints/versioned-agent-images.md

234 lines
12 KiB
Markdown
Raw Normal View History

2026-04-09 08:31:46 +00:00
# Sprint: versioned agent images
## Vision issues
- #429 — feat: publish versioned agent images — compose should use image: not build:
## What this enables
After this sprint, `disinto init` produces a `docker-compose.yml` that pulls a pinned image
from a registry instead of building from source. A new factory instance needs only a token
and a config file — no clone, no build, no local Docker context. This closes the gap between
"works on my machine" and "one-command bootstrap."
It also enables rollback: if agents misbehave after an upgrade, `AGENTS_IMAGE=v0.1.1 disinto up`
restores the previous version without touching the codebase.
## What exists today
The release pipeline is more complete than it looks:
- `formulas/release.toml` — 7-step release formula. Steps 4-5 already build and tag the image
locally (`docker compose build --no-cache agents`, `docker tag disinto-agents disinto-agents:$RELEASE_VERSION`).
The gap: no push step, no registry target.
- `lib/release.sh` — Creates vault TOML and ops repo PR for the release. No image version wired
into compose generation.
- `lib/generators.sh` `_generate_compose_impl()` — Generates compose with `build: context: .
dockerfile: docker/agents/Dockerfile` for agents, runner, reproduce, edge. Version-unaware.
- `vault/vault-env.sh``DOCKER_HUB_TOKEN` is in `VAULT_ALLOWED_SECRETS`. Not currently used.
- `docker/agents/Dockerfile` — No VOLUME declarations; runtime state, repos, and config are
mounted via compose but not declared. Claude binary injected by compose at init time.
## Complexity
Files touched: 4
- `formulas/release.toml` — add `push-image` step (after tag-image, before restart-agents)
- `lib/generators.sh``_generate_compose_impl()` reads `AGENTS_IMAGE` env var; emits
`image:` when set, falls back to `build:` when not set (dev mode)
- `docker/agents/Dockerfile` — add explicit VOLUME declarations for /home/agent/data,
/home/agent/repos, /home/agent/disinto/projects, /home/agent/disinto/state
- `bin/disinto` `disinto_up()` — pass `AGENTS_IMAGE` through to compose if set in `.env`
Subsystems: release formula, compose generation, Dockerfile hygiene
Sub-issues: 3
Gluecode ratio: ~80% gluecode (release step, VOLUME declarations), ~20% new (AGENTS_IMAGE env var path)
## Risks
- Registry credentials: `DOCKER_HUB_TOKEN` is in vault allowlist but not wired up. The push step
needs a registry login — either Docker Hub (DOCKER_HUB_TOKEN) or GHCR (GITHUB_TOKEN, already
in vault). The sprint spec must pick one and add the credential to the release vault TOML.
- Volume shadow: if VOLUME declarations don't match the compose volume mounts exactly, runtime
files land in anonymous volumes instead of named ones. Must test before shipping.
- Existing deployments: currently on `build:`. Migration: set AGENTS_IMAGE in .env, re-run
`disinto init` (compose is regenerated), restart. No SSH, no worktree needed.
- `runner` service: same image as agents, same version. Must update runner service in compose gen too.
## Cost — new infra to maintain
- Registry account + token rotation: one vault secret (DOCKER_HUB_TOKEN) needs rotation policy.
GHCR (via GITHUB_TOKEN) has no additional account but ties release to GitHub.
- Release formula grows from 7 to 8 steps. Small maintenance surface.
- `AGENTS_IMAGE` becomes a documented env var in .env for pinned deployments. Needs docs.
## Recommendation
Worth it. The release formula is 90% done — one push step closes the gap. The compose
generation change is purely additive (AGENTS_IMAGE env var, fallback to build: for dev).
Volume declarations are hygiene that should exist regardless of versioning.
Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo.
No new account needed.
## Side effects of this sprint
Beyond versioned images, this sprint indirectly closes one open bug:
- **#665 (edge cold-start race)** — `disinto-edge` currently exits with code 128 on a cold
`disinto up` because its entrypoint clones from `forgejo:3000` before forgejo's HTTP
listener is up. Once edge's image embeds the disinto source at build time (no runtime
clone), the race vanishes. The `depends_on: { forgejo: { condition: service_healthy } }`
workaround proposed in #665 becomes unnecessary.
Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in
parallel — it would be churn this sprint deletes anyway.
## What this sprint does not yet enable
This sprint delivers versioned images and pinned compose. It is a foundation, not the
whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style
client boxes — each independently scopable, with the dependency chain noted.
### Follow-up A: `disinto upgrade <version>` subcommand
**Why**: even with versioned images, an operator on a client box still has to coordinate
multiple steps to upgrade — `git fetch && git checkout`, edit `.env` to set
`AGENTS_IMAGE`, re-run `_generate_compose_impl`, `docker compose pull`,
`docker compose up -d --force-recreate`, plus any out-of-band migrations. There is no
single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human
operation that drifts out of sync.
**Shape**:
```
disinto upgrade v0.3.0
```
Sequence (roughly):
1. `git fetch --tags` and verify the tag exists
2. Bail if the working tree is dirty
3. `git checkout v0.3.0`
4. `_env_set_idempotent AGENTS_IMAGE v0.3.0 .env` (helper from #641)
5. Re-run `_generate_compose_impl` (picks up the new image tag)
6. Run pre-upgrade migration hooks (Follow-up C)
7. `docker compose pull && docker compose up -d --force-recreate`
8. Run post-upgrade migration hooks
9. Health check; rollback to previous version on failure
10. Log result
**Files touched**: `bin/disinto` (~150 lines, new `disinto_upgrade()` function), possibly
extracted to a new `lib/upgrade.sh` if it grows large enough to warrant separation.
**Dependency**: this sprint (needs `AGENTS_IMAGE` to be a real thing in `.env` and in the
compose generator).
### Follow-up B: unify `DISINTO_VERSION` and `AGENTS_IMAGE`
**Why**: today there are two version concepts in the codebase:
- `DISINTO_VERSION` — used at `docker/edge/entrypoint-edge.sh:84` for the in-container
source clone (`git clone --branch ${DISINTO_VERSION:-main}`). Defaults to `main`. Also
set in the compose generator at `lib/generators.sh:397` for the edge service.
- `AGENTS_IMAGE` — proposed by this sprint for the docker image tag in compose.
These should be **the same value**. If you are running the `v0.3.0` agents image, the
in-container source (if any clone still happens) should also be at `v0.3.0`. Otherwise
you get a v0.3.0 binary running against v-something-else source, which is exactly the
silent drift versioning is meant to prevent.
After this sprint folds source into the image, `DISINTO_VERSION` in containers becomes
vestigial. The follow-up: pick one name (probably keep `DISINTO_VERSION` since it is
referenced in more places), have `_generate_compose_impl` set both `image:` and the env
var from the same source, and delete the redundant runtime clone block in
`entrypoint-edge.sh`.
**Files touched**: `lib/generators.sh`, `docker/edge/entrypoint-edge.sh` (delete the
runtime clone block once the image carries source), possibly `lib/env.sh` for the
default value.
**Dependency**: this sprint.
### Follow-up C: migration framework for breaking changes
**Why**: some upgrades have side effects beyond "new code in the container":
- The CLAUDE_CONFIG_DIR migration (#641`setup_claude_config_dir` in
`lib/claude-config.sh`) needs a one-time `mkdir + mv + symlink` per host.
- The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume
URL repair.
- Future: schema changes in the vault, ops repo restructures, env var renames.
There is no `disinto/migrations/v0.3.0.sh` style framework. Existing migrations live
ad-hoc inside `disinto init` and run unconditionally on init. That works for fresh
installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations
v0.2.1 → v0.2.2 → v0.3.0 to run in order".
**Shape**: a `migrations/` directory with one file per version (`v0.3.0.sh`,
`v0.3.1.sh`, …). `disinto upgrade` (Follow-up A) invokes each migration file in order
between the previous applied version and the target. Track the applied version in
`.env` (e.g. `DISINTO_LAST_MIGRATION=v0.3.0`) or in `state/`. Standard
rails/django/flyway pattern. The framework itself is small; the value is in having a
place for migrations to live so they are not scattered through `disinto init` and lost
in code review.
**Files touched**: `lib/upgrade.sh` (the upgrade command is the natural caller), new
`migrations/` directory, a tracking key in `.env` for the last applied migration
version.
**Dependency**: Follow-up A (the upgrade command is the natural caller).
### Follow-up D: bootstrap-from-broken-state runbook
**Why**: this sprint and Follow-ups AC describe the steady-state upgrade flow. But
existing client boxes — harb-dev-box specifically — are not in steady state. harb's
working tree is at tag `v0.2.0` (months behind main). Its containers are running locally
built `:latest` images of unknown vintage. Some host-level state (`CLAUDE_CONFIG_DIR`,
`~/.git/config` credential helper from the disinto-dev-box rollout) has not been applied
on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there
is too much drift.
Each existing client box needs a **one-time manual reset** to a known-good baseline
before the versioned upgrade flow takes over. The reset is mechanical but not
automatable — it touches host-level state that pre-dates the new flow.
**Shape**: a documented runbook at `docs/client-box-bootstrap.md` (or similar) that
walks operators through the one-time reset:
1. `disinto down`
2. `git fetch --all && git checkout <latest tag>` on the working tree
3. Apply host-level migrations:
- `setup_claude_config_dir true` (from `lib/claude-config.sh`, added in #641)
- Strip embedded creds from `.git/config`'s forgejo remote and add the inline
credential helper using the pattern from #669
- Rotate `FORGE_PASS` and `FORGE_TOKEN` if they have leaked (separate decision)
4. Rebuild images (`docker compose build`) or pull from registry once this sprint lands
5. `disinto up`
6. Verify with `disinto status` and a smoke fetch through the credential helper
After the reset, the box is in a known-good baseline and `disinto upgrade <version>`
takes over for all subsequent upgrades. The runbook documents this as the only manual
operation an operator should ever have to perform on a client box.
**Files touched**: new `docs/client-box-bootstrap.md`. Optionally a small change to
`disinto init` to detect "this looks like a stale-state box that needs the reset
runbook, not a fresh init" and refuse with a pointer to the runbook.
**Dependency**: none (can be done in parallel with this sprint and the others).
## Updated recommendation
The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over
Docker Hub. Layered on top:
- **Sequence the four follow-ups**: A (upgrade subcommand) and D (bootstrap runbook) are
independent of this sprint's image work and can land in parallel. B (version
unification) is small cleanup that depends on this sprint. C (migration framework) can
wait until the first migration that actually needs it — `setup_claude_config_dir`
doesn't, since it already lives in `disinto init`.
- **Do not fix #665 in parallel**: as noted in "Side effects", this sprint deletes the
cause. A `depends_on: service_healthy` workaround applied to edge in parallel would be
wasted work.
- **Do not file separate forge issues for the follow-ups until this sprint is broken into
sub-issues**: keep them in this document until the architect (or the operator) is ready
to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay
reorderable as the sprint shape evolves.