sprint: add nomad-dispatcher-cutover.md

2026-04-18 16:22:21 +00:00
2 changed files with 132 additions and 106 deletions
--- a/sprints/edge-subpath-chat.md
+++ b/sprints/edge-subpath-chat.md
@ -1,106 +0,0 @@
-# Sprint: edge-subpath-chat
-
-## Vision issues
- #623 — vision: subpath routing + Forgejo-OAuth-gated Claude chat inside the edge container
-
-## What this enables
-After this sprint, an operator running `disinto edge register` gets a single URL — `<project>.disinto.ai` — with Forgejo at `/forge/`, Woodpecker CI at `/ci/`, a staging preview at `/staging/`, and an OAuth-gated Claude Code chat at `/chat/`, all under one wildcard cert and one bootstrap password. The factory talks back to its operator through a chat window that sits next to the forge, CI, and live preview it is driving.
-
-## What exists today
-The majority of this vision is already implemented across issues #704–#711:
-
- **Subpath routing**: Caddyfile generator produces `/forge/*`, `/ci/*`, `/staging/*`, `/chat/*` handlers (`lib/generators.sh:780–822`). Forgejo `ROOT_URL` and Woodpecker `WOODPECKER_HOST` are set to subpath values when `EDGE_TUNNEL_FQDN` is present (`bin/disinto:842–847`).
- **Chat container**: Full OAuth flow via Forgejo, HttpOnly session cookies, forward_auth defense-in-depth with `FORWARD_AUTH_SECRET`, per-user rate limiting (hourly/daily/token caps), conversation history in NDJSON (`docker/chat/server.py`).
- **Sandbox hardening**: Read-only rootfs, `cap_drop: ALL`, `no-new-privileges`, `pids_limit: 128`, `mem_limit: 512m`, no Docker socket. Verification script at `tools/edge-control/verify-chat-sandbox.sh`.
- **Edge control plane**: Tunnel registration, port allocation, Caddy admin API routing, wildcard `*.disinto.ai` cert via DNS-01 (`tools/edge-control/`).
- **Dependencies #620/#621/#622**: Admin password prompt, edge control plane, and reverse tunnel — all implemented and merged.
- **Subdomain fallback plan**: Fully documented at `docs/edge-routing-fallback.md` with pivot criteria.
-
-## Complexity
- ~6 files touched across 3 subsystems (Caddy routing, chat backend, compose generation)
- Estimated 4 sub-issues
- ~90% gluecode (wiring existing pieces), ~10% greenfield (WebSocket streaming, end-to-end smoke test)
-
-## Risks
- **Forgejo/Woodpecker subpath breakage**: Neither service is battle-tested under subpaths in this stack. Redirect loops, OAuth callback mismatches, or asset 404s are plausible. Mitigation: the fallback plan (`docs/edge-routing-fallback.md`) is already documented and estimated at under one day to pivot.
- **Cookie/CSRF collision**: Forgejo and chat share the same origin — cookie names or CSRF tokens could collide. Mitigation: chat uses a namespaced cookie (`disinto_chat_session`) and a separate OAuth app.
- **Streaming latency**: One-shot `claude --print` blocks until completion. Long responses leave the operator staring at a spinner. Not a correctness risk, but a UX risk that WebSocket streaming would fix.
-
-## Cost — new infra to maintain
- **No new services** — all containers already exist in the compose stack
- **No new scheduled tasks or formulas** — chat is a passive request handler
- **One new smoke test** (CI) — end-to-end subpath routing verification
- **Ongoing**: monitoring Forgejo/Woodpecker upstream for subpath regressions on upgrades
-
-## Recommendation
-Worth it. The vision is ~80% implemented. The remaining work is integration hardening (confirming subpath routing works end-to-end with real Forgejo/Woodpecker) and one UX improvement (WebSocket streaming). The risk is low because a documented fallback to per-service subdomains exists. Ship this sprint to close the loop on the edge experience.
-
-## Sub-issues
-
-<!-- filer:begin -->
- id: subpath-routing-smoke-test
-  title: "vision(#623): end-to-end subpath routing smoke test for Forgejo + Woodpecker + chat"
-  labels: [backlog]
-  depends_on: []
-  body: |
-    ## Goal
-    Verify that Forgejo, Woodpecker, and chat all function correctly when served
-    under /forge/, /ci/, and /chat/ subpaths on a single domain. Catch redirect
-    loops, OAuth callback failures, and asset 404s before they hit production.
-    ## Acceptance criteria
-    - [ ] Forgejo login at /forge/ completes without redirect loops
-    - [ ] Forgejo OAuth callback for Woodpecker succeeds under subpath
-    - [ ] Woodpecker dashboard loads all assets at /ci/ (no 404s on JS/CSS)
-    - [ ] Chat OAuth login flow works at /chat/login
-    - [ ] Forward_auth on /chat/* rejects unauthenticated requests with 401
-    - [ ] Staging content loads at /staging/
-    - [ ] Root / redirects to /forge/
-    - [ ] CI pipeline added to .woodpecker/ to run this test on edge-related changes
-
- id: websocket-streaming-chat
-  title: "vision(#623): WebSocket streaming for chat UI to replace one-shot claude --print"
-  labels: [backlog]
-  depends_on: [subpath-routing-smoke-test]
-  body: |
-    ## Goal
-    Replace the blocking one-shot claude --print invocation in the chat backend with
-    a WebSocket connection that streams tokens to the UI as they arrive.
-    ## Acceptance criteria
-    - [ ] /chat/ws endpoint accepts WebSocket upgrade with valid session cookie
-    - [ ] /chat/ws rejects upgrade if session cookie is missing or expired
-    - [ ] Chat backend streams claude output over WebSocket as text frames
-    - [ ] UI renders tokens incrementally as they arrive
-    - [ ] Rate limiting still enforced on WebSocket messages
-    - [ ] Caddy proxies WebSocket upgrade correctly through /chat/ws with forward_auth
-
- id: chat-working-dir-scoping
-  title: "vision(#623): scope Claude chat working directory to project staging checkout"
-  labels: [backlog]
-  depends_on: [subpath-routing-smoke-test]
-  body: |
-    ## Goal
-    Give the chat container Claude session read-write access to the project working
-    tree so the operator can inspect, explain, or modify code — scoped to that tree
-    only, with no access to factory internals, secrets, or Docker socket.
-    ## Acceptance criteria
-    - [ ] Chat container bind-mounts the project working tree as a named volume
-    - [ ] Claude invocation in server.py sets cwd to the workspace directory
-    - [ ] Claude permission mode is acceptEdits (not bypassPermissions)
-    - [ ] verify-chat-sandbox.sh updated to assert workspace mount exists
-    - [ ] Compose generator adds the workspace volume conditionally
-
- id: subpath-fallback-automation
-  title: "vision(#623): automate subdomain fallback pivot if subpath routing fails"
-  labels: [backlog]
-  depends_on: [subpath-routing-smoke-test]
-  body: |
-    ## Goal
-    If the smoke test reveals unfixable subpath issues, automate the pivot to
-    per-service subdomains so the switch is a single config change.
-    ## Acceptance criteria
-    - [ ] generators.sh _generate_caddyfile_impl accepts EDGE_ROUTING_MODE env var
-    - [ ] In subdomain mode, Caddyfile emits four host blocks per edge-routing-fallback.md
-    - [ ] register.sh registers additional subdomain routes when EDGE_ROUTING_MODE=subdomain
-    - [ ] OAuth redirect URIs in ci-setup.sh respect routing mode
-    - [ ] .env template documents EDGE_ROUTING_MODE with a comment referencing the fallback doc
-<!-- filer:end -->
--- a/sprints/nomad-dispatcher-cutover.md
+++ b/sprints/nomad-dispatcher-cutover.md
@ -0,0 +1,132 @@
+# Sprint: nomad-dispatcher-cutover
+
+## Vision issues
+- #981 — vision: [nomad-step-5] S5 — implement dispatcher Nomad backend + retire docker-compose dispatch
+
+## What this enables
+The edge dispatcher can launch vault-runner and sidecar (reproduce/triage/verify) jobs via Nomad instead of `docker run`. This completes the Nomad migration: every workload runs under Nomad's scheduler with Vault-managed secrets, enabling proper resource limits, restart policies, and audit trails. After cutover, the docker-compose dispatch path is retired — the dispatcher no longer needs the Docker socket.
+
+## What exists today
+- **`_launch_runner_nomad()`** (dispatcher.sh:561-725) — substantially implemented: dispatches via `nomad job dispatch -detach`, polls allocation state, extracts exit code and logs. Needs validation, not greenfield.
+- **`vault-runner.hcl`** — parameterized batch jobspec with pre-templated secrets (6 runner secrets via `error_on_missing_key=false` fallback). Ready for dispatch.
+- **`edge.hcl`** — dispatcher service task with `DISPATCHER_BACKEND=nomad` and `service-dispatcher` Vault role. Deployed.
+- **Per-secret Vault policies and roles** — `runner-GITHUB_TOKEN`, `runner-CODEBERG_TOKEN`, etc. exist in `vault/policies/` and `vault/roles.yaml`.
+- **Docker backend** — `_launch_runner_docker()` and `_dispatch_sidecar_docker()` fully working as the production dispatch path.
+- **`_dispatch_sidecar_nomad()`** (dispatcher.sh:842-848) — pure stub, returns 1.
+- **No sidecar jobspec** — no Nomad equivalent of the reproduce/triage/verify containers.
+- **Mounts (ssh/gpg/sops)** — handled by Docker backend via bind mounts; `mounts_csv` is passed but ignored in Nomad path.
+
+## Complexity
+- **Primary files:** dispatcher.sh (~1300 lines, 2 functions to implement/fix), 1 new jobspec (sidecar batch), vault-runner.hcl (mount additions), vault policy composition logic
+- **Subsystems touched:** dispatcher, Nomad jobspecs, Vault policies, `bin/disinto` wiring, deploy.sh
+- **Estimated sub-issues:** 6
+- **Ratio:** ~80% gluecode (wiring existing Nomad/Vault primitives), ~20% new logic (sidecar jobspec, policy composition, cutover gate)
+
+## Risks
+- **Silent secret drop:** Nomad template renders missing secrets as empty strings — a dispatched runner could silently operate without credentials it needs. The docker path fails loudly via `load_secret()`.
+- **Sidecar lifecycle mismatch:** Docker sidecars run as background processes tracked by PID. Nomad batch jobs have different lifecycle semantics (no PID, allocation-based tracking). The polling loops in reproduce/triage candidate selection must adapt.
+- **Policy composition race:** If the dispatcher's Vault token lacks permission to attach per-dispatch policies, every nomad dispatch fails. This is a new capability the `service-dispatcher` role doesn't currently grant.
+- **Mount mapping:** Docker bind-mounts (docker.sock, ssh keys, gpg) don't map 1:1 to Nomad host volumes. Missing mounts = broken formulas that need credentials.
+- **Rollback gap:** If nomad dispatch breaks in production, there's no automatic fallback to docker unless explicitly coded.
+
+## Cost — new infra to maintain
+- **1 new jobspec:** sidecar batch job (reproduce/triage/verify) — parameterized, analogous to vault-runner.hcl
+- **Policy composition logic** in dispatcher — new code path that must stay in sync with `vault/policies/runner-*.hcl`
+- **No new services or scheduled tasks** — uses existing Nomad cluster, Vault instance, and dispatcher polling loop
+- **Host volume declarations** for ssh/gpg/sops on Nomad clients (if formulas require them)
+
+## Recommendation
+Worth it. This is the final step in the Nomad migration (S1-S4 complete). Most infrastructure exists — vault-runner.hcl, edge.hcl, per-secret policies, and the nomad launcher function are already landed. The work is predominantly wiring and validation. Deferring leaves the dispatcher dependent on the Docker socket, which contradicts the Nomad migration's security and scheduling goals. The sidecar jobspec is the only greenfield piece; everything else is completing stubs or adding integration tests.
+
+## Sub-issues
+
+<!-- filer:begin -->
+- id: s5-nomad-policy-composition
+  title: "vision(#981): dispatcher composes per-dispatch Vault policies for nomad runner"
+  labels: [backlog]
+  depends_on: []
+  body: |
+    ## Goal
+    The dispatcher dynamically attaches only the Vault policies required by each action's `secrets = [...]` list when dispatching a `vault-runner` batch job, so Nomad-dispatched runners receive scoped secret access.
+
+    ## Acceptance criteria
+    - [ ] Dispatcher reads `secrets` field from action TOML and maps each to its `runner-<NAME>` Vault policy
+    - [ ] `service-dispatcher` Vault policy updated to allow token creation with runner-* policies
+    - [ ] Dispatched vault-runner job receives a Vault token scoped to only the requested secrets
+    - [ ] Dispatch with empty `secrets = []` succeeds (runner gets no secret policies)
+    - [ ] Dispatch with unknown secret name logs error and fails before launch
+
+- id: s5-nomad-sidecar-jobspec
+  title: "vision(#981): add parameterized batch jobspec for reproduce/triage/verify sidecars"
+  labels: [backlog]
+  depends_on: []
+  body: |
+    ## Goal
+    Create a Nomad parameterized batch jobspec for sidecar containers (reproduce, triage, verify) analogous to vault-runner.hcl, so `_dispatch_sidecar_nomad()` has a job to dispatch.
+
+    ## Acceptance criteria
+    - [ ] `nomad/jobs/sidecar.hcl` exists as a parameterized batch job accepting `issue_number`, `formula`, and `project_toml` meta
+    - [ ] Jobspec mounts required host volumes (project-repos, ops-repo, agent-data)
+    - [ ] Vault template renders bot token for sidecar identity (e.g. `kv/data/disinto/bots/dev`)
+    - [ ] `nomad job validate nomad/jobs/sidecar.hcl` passes
+    - [ ] `bin/disinto --with edge` deploys the sidecar job alongside other jobs
+
+- id: s5-implement-dispatch-sidecar-nomad
+  title: "vision(#981): implement _dispatch_sidecar_nomad() in dispatcher.sh"
+  labels: [backlog]
+  depends_on: [s5-nomad-sidecar-jobspec]
+  body: |
+    ## Goal
+    Replace the stub `_dispatch_sidecar_nomad()` with a working implementation that dispatches reproduce/triage/verify jobs via Nomad and tracks their allocation lifecycle.
+
+    ## Acceptance criteria
+    - [ ] `_dispatch_sidecar_nomad()` dispatches `sidecar` parameterized batch job with correct meta
+    - [ ] Dispatcher tracks sidecar allocations instead of PIDs for the Nomad backend
+    - [ ] Reproduce/triage/verify polling loops correctly detect sidecar completion via Nomad allocation state
+    - [ ] Sidecar logs are retrievable via `nomad alloc logs` and included in dispatcher logging
+
+- id: s5-nomad-runner-mounts
+  title: "vision(#981): support ssh/gpg/sops mount aliases in Nomad dispatch path"
+  labels: [backlog]
+  depends_on: []
+  body: |
+    ## Goal
+    Action TOMLs with `mounts = ["ssh", "gpg", "sops"]` work correctly when dispatched via the Nomad backend, matching the Docker backend's bind-mount behavior.
+
+    ## Acceptance criteria
+    - [ ] Nomad client config declares host volumes for ssh, gpg, and sops credential paths
+    - [ ] `vault-runner.hcl` declares optional volume mounts for each alias
+    - [ ] `_launch_runner_nomad()` passes mount requirements as dispatch meta
+    - [ ] Runner container receives mounted credentials at expected paths
+    - [ ] Dispatch without mounts succeeds (no volume mount errors)
+
+- id: s5-nomad-dispatch-smoke-test
+  title: "vision(#981): end-to-end smoke test for Nomad dispatch path"
+  labels: [backlog]
+  depends_on: [s5-nomad-policy-composition, s5-implement-dispatch-sidecar-nomad, s5-nomad-runner-mounts]
+  body: |
+    ## Goal
+    Validate the full Nomad dispatch path works end-to-end: action TOML to dispatcher to nomad job dispatch to runner execution to result.json written back to ops repo.
+
+    ## Acceptance criteria
+    - [ ] Test action TOML dispatched with `DISPATCHER_BACKEND=nomad` completes successfully
+    - [ ] Runner receives correct secrets via Vault template (non-empty for granted, empty for ungranted)
+    - [ ] Result JSON written to ops repo with correct exit code and log excerpt
+    - [ ] Sidecar dispatch (reproduce formula) launches and completes via Nomad
+    - [ ] Failure case: action with invalid formula produces exit code != 0 in result
+
+- id: s5-retire-docker-compose-dispatch
+  title: "vision(#981): retire docker-compose dispatch path and remove Docker socket dependency"
+  labels: [backlog]
+  depends_on: [s5-nomad-dispatch-smoke-test]
+  body: |
+    ## Goal
+    Remove the docker-compose dispatch backend from the dispatcher after Nomad path is validated, eliminating the Docker socket mount from the edge container.
+
+    ## Acceptance criteria
+    - [ ] `_launch_runner_docker()` and `_dispatch_sidecar_docker()` removed from dispatcher.sh
+    - [ ] `DISPATCHER_BACKEND` env var removed; Nomad is the only path
+    - [ ] `edge.hcl` no longer mounts Docker socket
+    - [ ] `docker-compose.yml` edge service no longer mounts Docker socket (if compose is retained for dev)
+    - [ ] Dispatcher startup validates Nomad connectivity (nomad status) before entering poll loop
+<!-- filer:end -->