docs: [nomad-step-6] S6 — cutover runbook from docker-compose to Nomad+Vault #1037

Open
opened 2026-04-19 08:42:26 +00:00 by dev-bot · 0 comments
Collaborator

Context

Factory migration from docker-compose (disinto-dev-box @ 10.10.10.67) to Nomad+Vault (disinto-nomad-box @ 10.10.10.216).

Steps 0–5 verified via ./bin/disinto init --backend=nomad --import-env /tmp/.env --with edge on a wiped LXC. Step 6 is docs-only: the cutover runbook from live docker-compose to live Nomad.

Scope

Produce a runbook covering:

1. Pre-cutover readiness checklist

  • All 8 jobs (forgejo, woodpecker-server, woodpecker-agent, agents, staging, chat, edge, vault-runner) healthy on a fresh wipe+init
  • Known-open follow-ups closed (see Blockers below)
  • deploy.sh timeout extended past 240s for chat (cold-start observed ~200–270s)
  • Vault unseal key backed up off-box
  • Forgejo repos list + sizes captured from dev-box (sanity-check migration target)

2. Data migration matrix

Map each docker-compose named volume → Nomad host volume under /srv/disinto/*:

docker volume Nomad host volume Method
disinto_forgejo-data /srv/disinto/forgejo docker cp + rsync, then chown to forgejo uid
disinto_woodpecker-data /srv/disinto/woodpecker-server rsync
disinto_caddy_data /srv/disinto/caddy-data rsync (certs)
disinto_chat-config /srv/disinto/chat rsync
disinto_project-repos* /srv/disinto/agents-repos rsync (3 variants → merge)
disinto_agent-data* /srv/disinto/agents-data rsync
disinto_llama-data N/A (llama-only, on host) skip
disinto_disinto-logs /srv/disinto/logs rsync

Open questions:

  • Target host: reuse disinto-dev-box (wipe + switch) vs. new box? Impacts data copy path (local vs. over-network).
  • Forgejo version parity between docker image and Nomad job — confirm gitea.com/forgejo/forgejo:X.Y tag matches.

3. Secrets migration

  • .env on dev-box already imported via tools/vault-import.sh during step 2 verification.
  • OAuth secrets (woodpecker ↔ forgejo, chat ↔ forgejo) get regenerated on fresh init. Either:
    • (a) Keep new secrets, re-register OAuth apps in Forgejo post-cutover (simpler, preferred)
    • (b) Copy pre-existing secrets from dev-box → vault (preserves existing tokens)

Recommend (a) unless user sessions must survive cutover.

4. Cutover sequence (production box)

  1. Announce maintenance window (edge will 502 for ~10 min)
  2. On dev-box: docker-compose stop (keep volumes mounted)
  3. Snapshot volumes: tar -czf /srv/backup-pre-cutover-$(date +%Y%m%d).tar.gz /var/lib/docker/volumes/disinto_*
  4. Copy data into /srv/disinto/* (see matrix above)
  5. Install Nomad + Vault (lib/init/nomad/install.sh, systemd.sh)
  6. ./bin/disinto init --backend=nomad --import-env /root/.env --with edge (imports .env, seeds vault, deploys all jobs)
  7. Post-deploy: re-register OAuth apps, verify Woodpecker login, verify chat OAuth flow
  8. Update DO edge Caddy to point self.disinto.ai → new port (or keep port identical if reusing dev-box)
  9. Update autossh reverse-tunnel unit on new box (copy from dev-box's /etc/systemd/system/reverse-tunnel.service)
  10. Smoke test: curl https://self.disinto.ai/ (forgejo), CI pipeline on a test PR, chat login

5. Rollback plan

If Nomad stack fails to come up healthy within 30 min:

  1. nomad system gc && systemctl stop nomad vault
  2. docker-compose up -d on dev-box (volumes unchanged)
  3. Reverse DNS / tunnel changes from step 8–9 above
  4. File post-mortem issue with logs from /opt/nomad/data/alloc/*/alloc/logs/

Rollback-safe because step 3 snapshots volumes before any writes.

6. Post-cutover cleanup

  • After 1 week of stable Nomad operation: delete docker-compose files, docker system prune -a, remove docker-compose from systemd if unit exists
  • Remove disinto_* named volumes
  • Update project_nomad_migration.md memory: mark step 6 done, remove "cutover pending" lines
  • Close the umbrella vision issue #981 (step 5 parent)

Blockers (must close before executing this runbook)

  • #1034 — edge caddy task clones Forgejo from 127.0.0.1:3000 — unreachable from bridge network
  • #1035 — dispatcher task Missing: vault.read(kv/data/disinto/bots/vault) template error on fresh init
  • #1036 — chat deploy health-check timeout needs extension from 240s → 360s in lib/init/nomad/deploy.sh

Acceptance

  • Runbook reviewed and merged as docs/nomad-cutover-runbook.md
  • Dry-run on a wiped LXC: execute sections 2–4 against a snapshot copy of dev-box volumes, verify all services healthy
  • Rollback drill: deliberately fail one service, confirm rollback procedure restores docker-compose in <10 min
## Context Factory migration from docker-compose (`disinto-dev-box` @ 10.10.10.67) to Nomad+Vault (`disinto-nomad-box` @ 10.10.10.216). Steps 0–5 verified via `./bin/disinto init --backend=nomad --import-env /tmp/.env --with edge` on a wiped LXC. Step 6 is **docs-only**: the cutover runbook from live docker-compose to live Nomad. ## Scope Produce a runbook covering: ### 1. Pre-cutover readiness checklist - [ ] All 8 jobs (forgejo, woodpecker-server, woodpecker-agent, agents, staging, chat, edge, vault-runner) healthy on a fresh wipe+init - [ ] Known-open follow-ups closed (see Blockers below) - [ ] `deploy.sh` timeout extended past 240s for chat (cold-start observed ~200–270s) - [ ] Vault unseal key backed up off-box - [ ] Forgejo repos list + sizes captured from dev-box (sanity-check migration target) ### 2. Data migration matrix Map each docker-compose named volume → Nomad host volume under `/srv/disinto/*`: | docker volume | Nomad host volume | Method | |-------------------------------|------------------------------------|----------------------| | `disinto_forgejo-data` | `/srv/disinto/forgejo` | `docker cp` + rsync, then chown to forgejo uid | | `disinto_woodpecker-data` | `/srv/disinto/woodpecker-server` | rsync | | `disinto_caddy_data` | `/srv/disinto/caddy-data` | rsync (certs) | | `disinto_chat-config` | `/srv/disinto/chat` | rsync | | `disinto_project-repos*` | `/srv/disinto/agents-repos` | rsync (3 variants → merge) | | `disinto_agent-data*` | `/srv/disinto/agents-data` | rsync | | `disinto_llama-data` | N/A (llama-only, on host) | skip | | `disinto_disinto-logs` | `/srv/disinto/logs` | rsync | Open questions: - Target host: reuse disinto-dev-box (wipe + switch) vs. new box? Impacts data copy path (local vs. over-network). - Forgejo version parity between docker image and Nomad job — confirm `gitea.com/forgejo/forgejo:X.Y` tag matches. ### 3. Secrets migration - `.env` on dev-box already imported via `tools/vault-import.sh` during step 2 verification. - OAuth secrets (woodpecker ↔ forgejo, chat ↔ forgejo) get **regenerated** on fresh init. Either: - (a) Keep new secrets, re-register OAuth apps in Forgejo post-cutover (simpler, preferred) - (b) Copy pre-existing secrets from dev-box → vault (preserves existing tokens) Recommend (a) unless user sessions must survive cutover. ### 4. Cutover sequence (production box) 1. Announce maintenance window (edge will 502 for ~10 min) 2. On dev-box: `docker-compose stop` (keep volumes mounted) 3. Snapshot volumes: `tar -czf /srv/backup-pre-cutover-$(date +%Y%m%d).tar.gz /var/lib/docker/volumes/disinto_*` 4. Copy data into `/srv/disinto/*` (see matrix above) 5. Install Nomad + Vault (`lib/init/nomad/install.sh`, `systemd.sh`) 6. `./bin/disinto init --backend=nomad --import-env /root/.env --with edge` (imports .env, seeds vault, deploys all jobs) 7. Post-deploy: re-register OAuth apps, verify Woodpecker login, verify chat OAuth flow 8. Update DO edge Caddy to point `self.disinto.ai` → new port (or keep port identical if reusing dev-box) 9. Update autossh reverse-tunnel unit on new box (copy from dev-box's `/etc/systemd/system/reverse-tunnel.service`) 10. Smoke test: `curl https://self.disinto.ai/` (forgejo), CI pipeline on a test PR, chat login ### 5. Rollback plan If Nomad stack fails to come up healthy within 30 min: 1. `nomad system gc && systemctl stop nomad vault` 2. `docker-compose up -d` on dev-box (volumes unchanged) 3. Reverse DNS / tunnel changes from step 8–9 above 4. File post-mortem issue with logs from `/opt/nomad/data/alloc/*/alloc/logs/` Rollback-safe because step 3 snapshots volumes before any writes. ### 6. Post-cutover cleanup - After 1 week of stable Nomad operation: delete docker-compose files, `docker system prune -a`, remove docker-compose from systemd if unit exists - Remove `disinto_*` named volumes - Update `project_nomad_migration.md` memory: mark step 6 done, remove "cutover pending" lines - Close the umbrella vision issue #981 (step 5 parent) ## Blockers (must close before executing this runbook) - [ ] #1034 — edge caddy task clones Forgejo from `127.0.0.1:3000` — unreachable from bridge network - [ ] #1035 — dispatcher task `Missing: vault.read(kv/data/disinto/bots/vault)` template error on fresh init - [ ] #1036 — chat deploy health-check timeout needs extension from 240s → 360s in `lib/init/nomad/deploy.sh` ## Acceptance - Runbook reviewed and merged as `docs/nomad-cutover-runbook.md` - Dry-run on a wiped LXC: execute sections 2–4 against a snapshot copy of dev-box volumes, verify all services healthy - Rollback drill: deliberately fail one service, confirm rollback procedure restores docker-compose in <10 min
dev-bot added the
vision
label 2026-04-19 08:42:26 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1037
No description provided.