sprint(versioned-agent-images): add side-effects, follow-up sprints, updated recommendation #15
1 changed files with 164 additions and 0 deletions
|
|
@ -67,3 +67,167 @@ Volume declarations are hygiene that should exist regardless of versioning.
|
|||
|
||||
Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo.
|
||||
No new account needed.
|
||||
|
||||
## Side effects of this sprint
|
||||
|
||||
Beyond versioned images, this sprint indirectly closes one open bug:
|
||||
|
||||
- **#665 (edge cold-start race)** — `disinto-edge` currently exits with code 128 on a cold
|
||||
`disinto up` because its entrypoint clones from `forgejo:3000` before forgejo's HTTP
|
||||
listener is up. Once edge's image embeds the disinto source at build time (no runtime
|
||||
clone), the race vanishes. The `depends_on: { forgejo: { condition: service_healthy } }`
|
||||
workaround proposed in #665 becomes unnecessary.
|
||||
|
||||
Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in
|
||||
parallel — it would be churn this sprint deletes anyway.
|
||||
|
||||
## What this sprint does not yet enable
|
||||
|
||||
This sprint delivers versioned images and pinned compose. It is a foundation, not the
|
||||
whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style
|
||||
client boxes — each independently scopable, with the dependency chain noted.
|
||||
|
||||
### Follow-up A: `disinto upgrade <version>` subcommand
|
||||
|
||||
**Why**: even with versioned images, an operator on a client box still has to coordinate
|
||||
multiple steps to upgrade — `git fetch && git checkout`, edit `.env` to set
|
||||
`AGENTS_IMAGE`, re-run `_generate_compose_impl`, `docker compose pull`,
|
||||
`docker compose up -d --force-recreate`, plus any out-of-band migrations. There is no
|
||||
single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human
|
||||
operation that drifts out of sync.
|
||||
|
||||
**Shape**:
|
||||
|
||||
```
|
||||
disinto upgrade v0.3.0
|
||||
```
|
||||
|
||||
Sequence (roughly):
|
||||
|
||||
1. `git fetch --tags` and verify the tag exists
|
||||
2. Bail if the working tree is dirty
|
||||
3. `git checkout v0.3.0`
|
||||
4. `_env_set_idempotent AGENTS_IMAGE v0.3.0 .env` (helper from #641)
|
||||
5. Re-run `_generate_compose_impl` (picks up the new image tag)
|
||||
6. Run pre-upgrade migration hooks (Follow-up C)
|
||||
7. `docker compose pull && docker compose up -d --force-recreate`
|
||||
8. Run post-upgrade migration hooks
|
||||
9. Health check; rollback to previous version on failure
|
||||
10. Log result
|
||||
|
||||
**Files touched**: `bin/disinto` (~150 lines, new `disinto_upgrade()` function), possibly
|
||||
extracted to a new `lib/upgrade.sh` if it grows large enough to warrant separation.
|
||||
|
||||
**Dependency**: this sprint (needs `AGENTS_IMAGE` to be a real thing in `.env` and in the
|
||||
compose generator).
|
||||
|
||||
### Follow-up B: unify `DISINTO_VERSION` and `AGENTS_IMAGE`
|
||||
|
||||
**Why**: today there are two version concepts in the codebase:
|
||||
|
||||
- `DISINTO_VERSION` — used at `docker/edge/entrypoint-edge.sh:84` for the in-container
|
||||
source clone (`git clone --branch ${DISINTO_VERSION:-main}`). Defaults to `main`. Also
|
||||
set in the compose generator at `lib/generators.sh:397` for the edge service.
|
||||
- `AGENTS_IMAGE` — proposed by this sprint for the docker image tag in compose.
|
||||
|
||||
These should be **the same value**. If you are running the `v0.3.0` agents image, the
|
||||
in-container source (if any clone still happens) should also be at `v0.3.0`. Otherwise
|
||||
you get a v0.3.0 binary running against v-something-else source, which is exactly the
|
||||
silent drift versioning is meant to prevent.
|
||||
|
||||
After this sprint folds source into the image, `DISINTO_VERSION` in containers becomes
|
||||
vestigial. The follow-up: pick one name (probably keep `DISINTO_VERSION` since it is
|
||||
referenced in more places), have `_generate_compose_impl` set both `image:` and the env
|
||||
var from the same source, and delete the redundant runtime clone block in
|
||||
`entrypoint-edge.sh`.
|
||||
|
||||
**Files touched**: `lib/generators.sh`, `docker/edge/entrypoint-edge.sh` (delete the
|
||||
runtime clone block once the image carries source), possibly `lib/env.sh` for the
|
||||
default value.
|
||||
|
||||
**Dependency**: this sprint.
|
||||
|
||||
### Follow-up C: migration framework for breaking changes
|
||||
|
||||
**Why**: some upgrades have side effects beyond "new code in the container":
|
||||
|
||||
- The CLAUDE_CONFIG_DIR migration (#641 → `setup_claude_config_dir` in
|
||||
`lib/claude-config.sh`) needs a one-time `mkdir + mv + symlink` per host.
|
||||
- The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume
|
||||
URL repair.
|
||||
- Future: schema changes in the vault, ops repo restructures, env var renames.
|
||||
|
||||
There is no `disinto/migrations/v0.3.0.sh` style framework. Existing migrations live
|
||||
ad-hoc inside `disinto init` and run unconditionally on init. That works for fresh
|
||||
installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations
|
||||
v0.2.1 → v0.2.2 → v0.3.0 to run in order".
|
||||
|
||||
**Shape**: a `migrations/` directory with one file per version (`v0.3.0.sh`,
|
||||
`v0.3.1.sh`, …). `disinto upgrade` (Follow-up A) invokes each migration file in order
|
||||
between the previous applied version and the target. Track the applied version in
|
||||
`.env` (e.g. `DISINTO_LAST_MIGRATION=v0.3.0`) or in `state/`. Standard
|
||||
rails/django/flyway pattern. The framework itself is small; the value is in having a
|
||||
place for migrations to live so they are not scattered through `disinto init` and lost
|
||||
in code review.
|
||||
|
||||
**Files touched**: `lib/upgrade.sh` (the upgrade command is the natural caller), new
|
||||
`migrations/` directory, a tracking key in `.env` for the last applied migration
|
||||
version.
|
||||
|
||||
**Dependency**: Follow-up A (the upgrade command is the natural caller).
|
||||
|
||||
### Follow-up D: bootstrap-from-broken-state runbook
|
||||
|
||||
**Why**: this sprint and Follow-ups A–C describe the steady-state upgrade flow. But
|
||||
existing client boxes — harb-dev-box specifically — are not in steady state. harb's
|
||||
working tree is at tag `v0.2.0` (months behind main). Its containers are running locally
|
||||
built `:latest` images of unknown vintage. Some host-level state (`CLAUDE_CONFIG_DIR`,
|
||||
`~/.git/config` credential helper from the disinto-dev-box rollout) has not been applied
|
||||
on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there
|
||||
is too much drift.
|
||||
|
||||
Each existing client box needs a **one-time manual reset** to a known-good baseline
|
||||
before the versioned upgrade flow takes over. The reset is mechanical but not
|
||||
automatable — it touches host-level state that pre-dates the new flow.
|
||||
|
||||
**Shape**: a documented runbook at `docs/client-box-bootstrap.md` (or similar) that
|
||||
walks operators through the one-time reset:
|
||||
|
||||
1. `disinto down`
|
||||
2. `git fetch --all && git checkout <latest tag>` on the working tree
|
||||
3. Apply host-level migrations:
|
||||
- `setup_claude_config_dir true` (from `lib/claude-config.sh`, added in #641)
|
||||
- Strip embedded creds from `.git/config`'s forgejo remote and add the inline
|
||||
credential helper using the pattern from #669
|
||||
- Rotate `FORGE_PASS` and `FORGE_TOKEN` if they have leaked (separate decision)
|
||||
4. Rebuild images (`docker compose build`) or pull from registry once this sprint lands
|
||||
5. `disinto up`
|
||||
6. Verify with `disinto status` and a smoke fetch through the credential helper
|
||||
|
||||
After the reset, the box is in a known-good baseline and `disinto upgrade <version>`
|
||||
takes over for all subsequent upgrades. The runbook documents this as the only manual
|
||||
operation an operator should ever have to perform on a client box.
|
||||
|
||||
**Files touched**: new `docs/client-box-bootstrap.md`. Optionally a small change to
|
||||
`disinto init` to detect "this looks like a stale-state box that needs the reset
|
||||
runbook, not a fresh init" and refuse with a pointer to the runbook.
|
||||
|
||||
**Dependency**: none (can be done in parallel with this sprint and the others).
|
||||
|
||||
## Updated recommendation
|
||||
|
||||
The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over
|
||||
Docker Hub. Layered on top:
|
||||
|
||||
- **Sequence the four follow-ups**: A (upgrade subcommand) and D (bootstrap runbook) are
|
||||
independent of this sprint's image work and can land in parallel. B (version
|
||||
unification) is small cleanup that depends on this sprint. C (migration framework) can
|
||||
wait until the first migration that actually needs it — `setup_claude_config_dir`
|
||||
doesn't, since it already lives in `disinto init`.
|
||||
- **Do not fix #665 in parallel**: as noted in "Side effects", this sprint deletes the
|
||||
cause. A `depends_on: service_healthy` workaround applied to edge in parallel would be
|
||||
wasted work.
|
||||
- **Do not file separate forge issues for the follow-ups until this sprint is broken into
|
||||
sub-issues**: keep them in this document until the architect (or the operator) is ready
|
||||
to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay
|
||||
reorderable as the sprint shape evolves.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue