2026-04-11 10:20:08 +00:00
1 changed files with 164 additions and 0 deletions
--- a/sprints/versioned-agent-images.md
+++ b/sprints/versioned-agent-images.md
@ -67,3 +67,167 @@ Volume declarations are hygiene that should exist regardless of versioning.

 Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo.
 No new account needed.
+
+## Side effects of this sprint
+
+Beyond versioned images, this sprint indirectly closes one open bug:
+
+- **#665 (edge cold-start race)** — `disinto-edge` currently exits with code 128 on a cold
+  `disinto up` because its entrypoint clones from `forgejo:3000` before forgejo's HTTP
+  listener is up. Once edge's image embeds the disinto source at build time (no runtime
+  clone), the race vanishes. The `depends_on: { forgejo: { condition: service_healthy } }`
+  workaround proposed in #665 becomes unnecessary.
+
+  Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in
+  parallel — it would be churn this sprint deletes anyway.
+
+## What this sprint does not yet enable
+
+This sprint delivers versioned images and pinned compose. It is a foundation, not the
+whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style
+client boxes — each independently scopable, with the dependency chain noted.
+
+### Follow-up A: `disinto upgrade <version>` subcommand
+
+**Why**: even with versioned images, an operator on a client box still has to coordinate
+multiple steps to upgrade — `git fetch && git checkout`, edit `.env` to set
+`AGENTS_IMAGE`, re-run `_generate_compose_impl`, `docker compose pull`,
+`docker compose up -d --force-recreate`, plus any out-of-band migrations. There is no
+single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human
+operation that drifts out of sync.
+
+**Shape**:
+
+```
+disinto upgrade v0.3.0
+```
+
+Sequence (roughly):
+
+1. `git fetch --tags` and verify the tag exists
+2. Bail if the working tree is dirty
+3. `git checkout v0.3.0`
+4. `_env_set_idempotent AGENTS_IMAGE v0.3.0 .env` (helper from #641)
+5. Re-run `_generate_compose_impl` (picks up the new image tag)
+6. Run pre-upgrade migration hooks (Follow-up C)
+7. `docker compose pull && docker compose up -d --force-recreate`
+8. Run post-upgrade migration hooks
+9. Health check; rollback to previous version on failure
+10. Log result
+
+**Files touched**: `bin/disinto` (~150 lines, new `disinto_upgrade()` function), possibly
+extracted to a new `lib/upgrade.sh` if it grows large enough to warrant separation.
+
+**Dependency**: this sprint (needs `AGENTS_IMAGE` to be a real thing in `.env` and in the
+compose generator).
+
+### Follow-up B: unify `DISINTO_VERSION` and `AGENTS_IMAGE`
+
+**Why**: today there are two version concepts in the codebase:
+
+- `DISINTO_VERSION` — used at `docker/edge/entrypoint-edge.sh:84` for the in-container
+  source clone (`git clone --branch ${DISINTO_VERSION:-main}`). Defaults to `main`. Also
+  set in the compose generator at `lib/generators.sh:397` for the edge service.
+- `AGENTS_IMAGE` — proposed by this sprint for the docker image tag in compose.
+
+These should be **the same value**. If you are running the `v0.3.0` agents image, the
+in-container source (if any clone still happens) should also be at `v0.3.0`. Otherwise
+you get a v0.3.0 binary running against v-something-else source, which is exactly the
+silent drift versioning is meant to prevent.
+
+After this sprint folds source into the image, `DISINTO_VERSION` in containers becomes
+vestigial. The follow-up: pick one name (probably keep `DISINTO_VERSION` since it is
+referenced in more places), have `_generate_compose_impl` set both `image:` and the env
+var from the same source, and delete the redundant runtime clone block in
+`entrypoint-edge.sh`.
+
+**Files touched**: `lib/generators.sh`, `docker/edge/entrypoint-edge.sh` (delete the
+runtime clone block once the image carries source), possibly `lib/env.sh` for the
+default value.
+
+**Dependency**: this sprint.
+
+### Follow-up C: migration framework for breaking changes
+
+**Why**: some upgrades have side effects beyond "new code in the container":
+
+- The CLAUDE_CONFIG_DIR migration (#641 → `setup_claude_config_dir` in
+  `lib/claude-config.sh`) needs a one-time `mkdir + mv + symlink` per host.
+- The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume
+  URL repair.
+- Future: schema changes in the vault, ops repo restructures, env var renames.
+
+There is no `disinto/migrations/v0.3.0.sh` style framework. Existing migrations live
+ad-hoc inside `disinto init` and run unconditionally on init. That works for fresh
+installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations
+v0.2.1 → v0.2.2 → v0.3.0 to run in order".
+
+**Shape**: a `migrations/` directory with one file per version (`v0.3.0.sh`,
+`v0.3.1.sh`, …). `disinto upgrade` (Follow-up A) invokes each migration file in order
+between the previous applied version and the target. Track the applied version in
+`.env` (e.g. `DISINTO_LAST_MIGRATION=v0.3.0`) or in `state/`. Standard
+rails/django/flyway pattern. The framework itself is small; the value is in having a
+place for migrations to live so they are not scattered through `disinto init` and lost
+in code review.
+
+**Files touched**: `lib/upgrade.sh` (the upgrade command is the natural caller), new
+`migrations/` directory, a tracking key in `.env` for the last applied migration
+version.
+
+**Dependency**: Follow-up A (the upgrade command is the natural caller).
+
+### Follow-up D: bootstrap-from-broken-state runbook
+
+**Why**: this sprint and Follow-ups A–C describe the steady-state upgrade flow. But
+existing client boxes — harb-dev-box specifically — are not in steady state. harb's
+working tree is at tag `v0.2.0` (months behind main). Its containers are running locally
+built `:latest` images of unknown vintage. Some host-level state (`CLAUDE_CONFIG_DIR`,
+`~/.git/config` credential helper from the disinto-dev-box rollout) has not been applied
+on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there
+is too much drift.
+
+Each existing client box needs a **one-time manual reset** to a known-good baseline
+before the versioned upgrade flow takes over. The reset is mechanical but not
+automatable — it touches host-level state that pre-dates the new flow.
+
+**Shape**: a documented runbook at `docs/client-box-bootstrap.md` (or similar) that
+walks operators through the one-time reset:
+
+1. `disinto down`
+2. `git fetch --all && git checkout <latest tag>` on the working tree
+3. Apply host-level migrations:
+   - `setup_claude_config_dir true` (from `lib/claude-config.sh`, added in #641)
+   - Strip embedded creds from `.git/config`'s forgejo remote and add the inline
+     credential helper using the pattern from #669
+   - Rotate `FORGE_PASS` and `FORGE_TOKEN` if they have leaked (separate decision)
+4. Rebuild images (`docker compose build`) or pull from registry once this sprint lands
+5. `disinto up`
+6. Verify with `disinto status` and a smoke fetch through the credential helper
+
+After the reset, the box is in a known-good baseline and `disinto upgrade <version>`
+takes over for all subsequent upgrades. The runbook documents this as the only manual
+operation an operator should ever have to perform on a client box.
+
+**Files touched**: new `docs/client-box-bootstrap.md`. Optionally a small change to
+`disinto init` to detect "this looks like a stale-state box that needs the reset
+runbook, not a fresh init" and refuse with a pointer to the runbook.
+
+**Dependency**: none (can be done in parallel with this sprint and the others).
+
+## Updated recommendation
+
+The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over
+Docker Hub. Layered on top:
+
+- **Sequence the four follow-ups**: A (upgrade subcommand) and D (bootstrap runbook) are
+  independent of this sprint's image work and can land in parallel. B (version
+  unification) is small cleanup that depends on this sprint. C (migration framework) can
+  wait until the first migration that actually needs it — `setup_claude_config_dir`
+  doesn't, since it already lives in `disinto init`.
+- **Do not fix #665 in parallel**: as noted in "Side effects", this sprint deletes the
+  cause. A `depends_on: service_healthy` workaround applied to edge in parallel would be
+  wasted work.
+- **Do not file separate forge issues for the follow-ups until this sprint is broken into
+  sub-issues**: keep them in this document until the architect (or the operator) is ready
+  to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay
+  reorderable as the sprint shape evolves.