docs: nomad-cutover-runbook.md — end-to-end cutover procedure #1060

Closed
opened 2026-04-19 20:06:22 +00:00 by dev-bot · 0 comments
Collaborator

Goal: write docs/nomad-cutover-runbook.md — the end-to-end procedure to cutover factory from docker-compose on disinto-dev-box to Nomad on disinto-nomad-box.

Prerequisites (must be done before running the runbook):

  • Companion tools landed: disinto backup create (#1057), disinto backup import (#1058)
  • disinto-nomad-box Nomad + Vault stack healthy on a fresh init test (step 5 verified)
  • Codeberg push mirror for disinto-admin/disinto confirmed in sync (done 2026-04-19)

Cutover decisions already locked (from parent #1037 scoping):

  • Target: disinto-nomad-box (10.10.10.216) becomes production; keep disinto-dev-box warm for rollback
  • Downtime: <5min blue-green flip
  • Secrets: regenerate OAuth on fresh init (sessions invalidated)
  • Data scope: issues + disinto-ops bundle only; everything else regenerated or discarded

Runbook sections (deliverable as docs/nomad-cutover-runbook.md)

1. Pre-cutover readiness checklist

  • Nomad stack healthy on fresh wipe+init
  • Codeberg mirror current (git log parity between dev-box Forgejo and Codeberg)
  • SSH key pair generated for new box, registered on DO edge (operator step — see section 5.4)
  • Backup tarball produced (#1057) and tested against scratch LXC (#1058)

2. Pre-cutover artifact: backup

./bin/disinto backup create /tmp/disinto-backup-$(date +%Y%m%d).tar.gz

3. Pre-cutover dry-run

On a throwaway LXC (lxc launch ubuntu:24.04 cutover-dryrun):

disinto init --backend=nomad --import-env .env --with edge
./bin/disinto backup import /tmp/disinto-backup-*.tar.gz

Verify issue count + ops-repo refs match source.

4. Cutover T-0 (operator executes; <5min target)

  1. docker-compose stop on dev-box (preserves volumes for rollback)
  2. disinto init --backend=nomad --import-env .env --with edge on nomad-box (if not already provisioned)
  3. ./bin/disinto backup import /tmp/disinto-backup-*.tar.gz on nomad-box
  4. Configure Codeberg → new Forgejo pull mirror (manual, one-time)
  5. claude login on nomad-box to set up Anthropic OAuth
  6. Autossh tunnel swap (operator step — cross-host, no dev-agent involvement):
    • systemctl stop reverse-tunnel on dev-box
    • Copy /etc/systemd/system/reverse-tunnel.service to nomad-box, or regenerate via init
    • On DO edge box: echo <nomad-box-pubkey> >> /home/johba/.ssh/authorized_keys with same restricted-command as the dev-box key
    • systemctl enable --now reverse-tunnel on nomad-box
    • Verify: curl https://self.disinto.ai/api/v1/version returns new box's version

5. Post-cutover smoke

  • curl https://self.disinto.ai → Forgejo welcome
  • Create a test PR → Woodpecker pipeline runs → agents assign and work
  • Claude chat login via Forgejo OAuth

6. Rollback (if any step 4 gate fails)

  1. systemctl stop reverse-tunnel on nomad-box
  2. systemctl start reverse-tunnel on dev-box
  3. docker-compose up -d on dev-box
  4. DO Caddy unchanged — traffic restored in <5min
  5. File post-mortem; keep nomad-box state for debugging

7. Post-stable cleanup (T+1 week)

  • docker-compose down -v on dev-box
  • Archive /var/lib/docker/volumes/disinto_* to cold storage
  • Delete disinto-dev-box LXC or keep as permanent rollback reserve (operator decision)

File location

docs/nomad-cutover-runbook.md

Acceptance

  • Runbook committed + reviewed (PR against main)
  • Dry-run walkthrough on scratch LXC: sections 2–3 executable end-to-end, produces a restored factory
  • Rollback section: rehearsed on scratch, restores docker-compose state in under 10 min

Scope hint for implementer

This is a docs-only PR. No code changes — just write the markdown per the structure above. Reference #1057 and #1058 by number where the runbook invokes those tools. Do NOT try to automate the operator steps (section 4.6) — those are intentionally manual.

Tracking: parent #1037 closes when this runbook is merged AND has been used successfully for a real cutover.

**Goal**: write `docs/nomad-cutover-runbook.md` — the end-to-end procedure to cutover factory from docker-compose on disinto-dev-box to Nomad on disinto-nomad-box. **Prerequisites** (must be done before running the runbook): - [ ] Companion tools landed: `disinto backup create` (#1057), `disinto backup import` (#1058) - [ ] disinto-nomad-box Nomad + Vault stack healthy on a fresh init test (step 5 verified) - [ ] Codeberg push mirror for `disinto-admin/disinto` confirmed in sync (done 2026-04-19) **Cutover decisions already locked** (from parent #1037 scoping): - **Target**: disinto-nomad-box (10.10.10.216) becomes production; keep disinto-dev-box warm for rollback - **Downtime**: <5min blue-green flip - **Secrets**: regenerate OAuth on fresh init (sessions invalidated) - **Data scope**: issues + disinto-ops bundle only; everything else regenerated or discarded ## Runbook sections (deliverable as `docs/nomad-cutover-runbook.md`) ### 1. Pre-cutover readiness checklist - Nomad stack healthy on fresh wipe+init - Codeberg mirror current (`git log` parity between dev-box Forgejo and Codeberg) - SSH key pair generated for new box, registered on DO edge (operator step — see section 5.4) - Backup tarball produced (#1057) and tested against scratch LXC (#1058) ### 2. Pre-cutover artifact: backup ``` ./bin/disinto backup create /tmp/disinto-backup-$(date +%Y%m%d).tar.gz ``` ### 3. Pre-cutover dry-run On a throwaway LXC (`lxc launch ubuntu:24.04 cutover-dryrun`): ``` disinto init --backend=nomad --import-env .env --with edge ./bin/disinto backup import /tmp/disinto-backup-*.tar.gz ``` Verify issue count + ops-repo refs match source. ### 4. Cutover T-0 (operator executes; <5min target) 1. `docker-compose stop` on dev-box (preserves volumes for rollback) 2. `disinto init --backend=nomad --import-env .env --with edge` on nomad-box (if not already provisioned) 3. `./bin/disinto backup import /tmp/disinto-backup-*.tar.gz` on nomad-box 4. Configure Codeberg → new Forgejo pull mirror (manual, one-time) 5. `claude login` on nomad-box to set up Anthropic OAuth 6. **Autossh tunnel swap** (operator step — cross-host, no dev-agent involvement): - `systemctl stop reverse-tunnel` on dev-box - Copy `/etc/systemd/system/reverse-tunnel.service` to nomad-box, or regenerate via init - On DO edge box: `echo <nomad-box-pubkey> >> /home/johba/.ssh/authorized_keys` with same restricted-command as the dev-box key - `systemctl enable --now reverse-tunnel` on nomad-box - Verify: `curl https://self.disinto.ai/api/v1/version` returns new box's version ### 5. Post-cutover smoke - `curl https://self.disinto.ai` → Forgejo welcome - Create a test PR → Woodpecker pipeline runs → agents assign and work - Claude chat login via Forgejo OAuth ### 6. Rollback (if any step 4 gate fails) 1. `systemctl stop reverse-tunnel` on nomad-box 2. `systemctl start reverse-tunnel` on dev-box 3. `docker-compose up -d` on dev-box 4. DO Caddy unchanged — traffic restored in <5min 5. File post-mortem; keep nomad-box state for debugging ### 7. Post-stable cleanup (T+1 week) - `docker-compose down -v` on dev-box - Archive `/var/lib/docker/volumes/disinto_*` to cold storage - Delete disinto-dev-box LXC or keep as permanent rollback reserve (operator decision) ## File location `docs/nomad-cutover-runbook.md` ## Acceptance - Runbook committed + reviewed (PR against main) - Dry-run walkthrough on scratch LXC: sections 2–3 executable end-to-end, produces a restored factory - Rollback section: rehearsed on scratch, restores docker-compose state in under 10 min ## Scope hint for implementer This is a **docs-only PR**. No code changes — just write the markdown per the structure above. Reference #1057 and #1058 by number where the runbook invokes those tools. Do NOT try to automate the operator steps (section 4.6) — those are intentionally manual. **Tracking**: parent #1037 closes when this runbook is merged AND has been used successfully for a real cutover.
dev-bot added the
backlog
bug-report
labels 2026-04-19 20:06:22 +00:00
dev-bot changed title from docs to docs: nomad-cutover-runbook.md — end-to-end cutover procedure 2026-04-19 20:06:31 +00:00
dev-bot removed the
bug-report
label 2026-04-19 20:06:38 +00:00
dev-bot self-assigned this 2026-04-19 20:49:31 +00:00
dev-bot added
in-progress
and removed
backlog
labels 2026-04-19 20:49:32 +00:00
dev-bot removed their assignment 2026-04-19 21:01:04 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1060
No description provided.