From 3a172bcc86f56df48fd6b7a5f9536d993b688745 Mon Sep 17 00:00:00 2001 From: dev-bot Date: Sat, 11 Apr 2026 10:09:57 +0000 Subject: [PATCH] sprint(versioned-agent-images): add side-effects, four follow-up sprints, updated recommendation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Enriches the architect's existing sprint plan with: 1. Side effects: this sprint indirectly closes #665 (edge cold-start race) by removing the runtime clone — flagging so a parallel #665 fix isn't applied. 2. Four follow-up sprints that complete the client-box upgrade story: - A: 'disinto upgrade ' subcommand for atomic client-side upgrades - B: unify DISINTO_VERSION and AGENTS_IMAGE into one version concept - C: migration framework for breaking changes (per-version migration files) - D: bootstrap-from-broken-state runbook for existing drifted boxes (harb) 3. Updated recommendation that sequences the follow-ups against this sprint and notes #665 should not be fixed in parallel. The original sprint scope (4 files, ~80% gluecode, GHCR) is unchanged and remains tightly scoped. The follow-ups are deliberately kept inside this document rather than filed as separate forge issues until the sprint plan is ready to be broken into sub-issues by the architect. --- sprints/versioned-agent-images.md | 164 ++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) diff --git a/sprints/versioned-agent-images.md b/sprints/versioned-agent-images.md index e6d30ce..3390d35 100644 --- a/sprints/versioned-agent-images.md +++ b/sprints/versioned-agent-images.md @@ -67,3 +67,167 @@ Volume declarations are hygiene that should exist regardless of versioning. Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo. No new account needed. + +## Side effects of this sprint + +Beyond versioned images, this sprint indirectly closes one open bug: + +- **#665 (edge cold-start race)** — `disinto-edge` currently exits with code 128 on a cold + `disinto up` because its entrypoint clones from `forgejo:3000` before forgejo's HTTP + listener is up. Once edge's image embeds the disinto source at build time (no runtime + clone), the race vanishes. The `depends_on: { forgejo: { condition: service_healthy } }` + workaround proposed in #665 becomes unnecessary. + + Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in + parallel — it would be churn this sprint deletes anyway. + +## What this sprint does not yet enable + +This sprint delivers versioned images and pinned compose. It is a foundation, not the +whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style +client boxes — each independently scopable, with the dependency chain noted. + +### Follow-up A: `disinto upgrade ` subcommand + +**Why**: even with versioned images, an operator on a client box still has to coordinate +multiple steps to upgrade — `git fetch && git checkout`, edit `.env` to set +`AGENTS_IMAGE`, re-run `_generate_compose_impl`, `docker compose pull`, +`docker compose up -d --force-recreate`, plus any out-of-band migrations. There is no +single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human +operation that drifts out of sync. + +**Shape**: + +``` +disinto upgrade v0.3.0 +``` + +Sequence (roughly): + +1. `git fetch --tags` and verify the tag exists +2. Bail if the working tree is dirty +3. `git checkout v0.3.0` +4. `_env_set_idempotent AGENTS_IMAGE v0.3.0 .env` (helper from #641) +5. Re-run `_generate_compose_impl` (picks up the new image tag) +6. Run pre-upgrade migration hooks (Follow-up C) +7. `docker compose pull && docker compose up -d --force-recreate` +8. Run post-upgrade migration hooks +9. Health check; rollback to previous version on failure +10. Log result + +**Files touched**: `bin/disinto` (~150 lines, new `disinto_upgrade()` function), possibly +extracted to a new `lib/upgrade.sh` if it grows large enough to warrant separation. + +**Dependency**: this sprint (needs `AGENTS_IMAGE` to be a real thing in `.env` and in the +compose generator). + +### Follow-up B: unify `DISINTO_VERSION` and `AGENTS_IMAGE` + +**Why**: today there are two version concepts in the codebase: + +- `DISINTO_VERSION` — used at `docker/edge/entrypoint-edge.sh:84` for the in-container + source clone (`git clone --branch ${DISINTO_VERSION:-main}`). Defaults to `main`. Also + set in the compose generator at `lib/generators.sh:397` for the edge service. +- `AGENTS_IMAGE` — proposed by this sprint for the docker image tag in compose. + +These should be **the same value**. If you are running the `v0.3.0` agents image, the +in-container source (if any clone still happens) should also be at `v0.3.0`. Otherwise +you get a v0.3.0 binary running against v-something-else source, which is exactly the +silent drift versioning is meant to prevent. + +After this sprint folds source into the image, `DISINTO_VERSION` in containers becomes +vestigial. The follow-up: pick one name (probably keep `DISINTO_VERSION` since it is +referenced in more places), have `_generate_compose_impl` set both `image:` and the env +var from the same source, and delete the redundant runtime clone block in +`entrypoint-edge.sh`. + +**Files touched**: `lib/generators.sh`, `docker/edge/entrypoint-edge.sh` (delete the +runtime clone block once the image carries source), possibly `lib/env.sh` for the +default value. + +**Dependency**: this sprint. + +### Follow-up C: migration framework for breaking changes + +**Why**: some upgrades have side effects beyond "new code in the container": + +- The CLAUDE_CONFIG_DIR migration (#641 → `setup_claude_config_dir` in + `lib/claude-config.sh`) needs a one-time `mkdir + mv + symlink` per host. +- The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume + URL repair. +- Future: schema changes in the vault, ops repo restructures, env var renames. + +There is no `disinto/migrations/v0.3.0.sh` style framework. Existing migrations live +ad-hoc inside `disinto init` and run unconditionally on init. That works for fresh +installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations +v0.2.1 → v0.2.2 → v0.3.0 to run in order". + +**Shape**: a `migrations/` directory with one file per version (`v0.3.0.sh`, +`v0.3.1.sh`, …). `disinto upgrade` (Follow-up A) invokes each migration file in order +between the previous applied version and the target. Track the applied version in +`.env` (e.g. `DISINTO_LAST_MIGRATION=v0.3.0`) or in `state/`. Standard +rails/django/flyway pattern. The framework itself is small; the value is in having a +place for migrations to live so they are not scattered through `disinto init` and lost +in code review. + +**Files touched**: `lib/upgrade.sh` (the upgrade command is the natural caller), new +`migrations/` directory, a tracking key in `.env` for the last applied migration +version. + +**Dependency**: Follow-up A (the upgrade command is the natural caller). + +### Follow-up D: bootstrap-from-broken-state runbook + +**Why**: this sprint and Follow-ups A–C describe the steady-state upgrade flow. But +existing client boxes — harb-dev-box specifically — are not in steady state. harb's +working tree is at tag `v0.2.0` (months behind main). Its containers are running locally +built `:latest` images of unknown vintage. Some host-level state (`CLAUDE_CONFIG_DIR`, +`~/.git/config` credential helper from the disinto-dev-box rollout) has not been applied +on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there +is too much drift. + +Each existing client box needs a **one-time manual reset** to a known-good baseline +before the versioned upgrade flow takes over. The reset is mechanical but not +automatable — it touches host-level state that pre-dates the new flow. + +**Shape**: a documented runbook at `docs/client-box-bootstrap.md` (or similar) that +walks operators through the one-time reset: + +1. `disinto down` +2. `git fetch --all && git checkout ` on the working tree +3. Apply host-level migrations: + - `setup_claude_config_dir true` (from `lib/claude-config.sh`, added in #641) + - Strip embedded creds from `.git/config`'s forgejo remote and add the inline + credential helper using the pattern from #669 + - Rotate `FORGE_PASS` and `FORGE_TOKEN` if they have leaked (separate decision) +4. Rebuild images (`docker compose build`) or pull from registry once this sprint lands +5. `disinto up` +6. Verify with `disinto status` and a smoke fetch through the credential helper + +After the reset, the box is in a known-good baseline and `disinto upgrade ` +takes over for all subsequent upgrades. The runbook documents this as the only manual +operation an operator should ever have to perform on a client box. + +**Files touched**: new `docs/client-box-bootstrap.md`. Optionally a small change to +`disinto init` to detect "this looks like a stale-state box that needs the reset +runbook, not a fresh init" and refuse with a pointer to the runbook. + +**Dependency**: none (can be done in parallel with this sprint and the others). + +## Updated recommendation + +The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over +Docker Hub. Layered on top: + +- **Sequence the four follow-ups**: A (upgrade subcommand) and D (bootstrap runbook) are + independent of this sprint's image work and can land in parallel. B (version + unification) is small cleanup that depends on this sprint. C (migration framework) can + wait until the first migration that actually needs it — `setup_claude_config_dir` + doesn't, since it already lives in `disinto init`. +- **Do not fix #665 in parallel**: as noted in "Side effects", this sprint deletes the + cause. A `depends_on: service_healthy` workaround applied to edge in parallel would be + wasted work. +- **Do not file separate forge issues for the follow-ups until this sprint is broken into + sub-issues**: keep them in this document until the architect (or the operator) is ready + to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay + reorderable as the sprint shape evolves.