Merge pull request 'architect: edge-subpath-chat (#623 )' (#37 ) from architect/edge-subpath-chat into main

sprint: add edge-subpath-chat.md
Merge pull request 'architect: supervisor Docker storage telemetry' (#34 ) from architect/supervisor-docker-storage into main
2026-04-18 22:26:36 +00:00 · 2026-04-16 02:15:24 +00:00 · 2026-04-15 17:39:19 +00:00 · 2026-04-15 17:39:12 +00:00 · 2026-04-15 17:39:09 +00:00 · 2026-04-15 17:39:03 +00:00
10 changed files with 832 additions and 158 deletions
--- a/sprints/agent-management-redesign.md
+++ b/sprints/agent-management-redesign.md
@ -0,0 +1,52 @@
 # Sprint: agent management redesign
 ## Vision issues
 - #557 — redesign agent management — hire by inference backend, list by capability
 ## What this enables
 After this sprint, operators can:
 1. Hire agents by backend (disinto hire anthropic, disinto hire llama --url ...) instead of inventing names and roles
 2. List all agents (disinto agents list) with backend, model, roles, and status in one table
 3. Discover what is running without grepping compose files, TOML configs, and state directories
 The factory becomes self-describing: an operator who inherits a running instance can immediately see what agents exist, what backends they use, and what roles they fill.
 ## What exists today
 The agent management system is functional but fragmented:
 - disinto hire-an-agent name role (lib/hire-agent.sh): Creates Forgejo user, .profile repo, API token, state file, and optionally writes agents TOML section plus regenerates compose. Works, but the mental model is backwards — operator must invent a name and pick a role before specifying the backend.
 - disinto agent enable/disable/status (bin/disinto): Manages state files for 6 hardcoded core agents (dev, reviewer, gardener, architect, planner, predictor). Local-model agents are invisible to this command.
 - agents TOML sections (projects/*.toml): Store local-model agent config (base_url, model, roles, forge_user). Read by lib/generators.sh to generate per-agent docker-compose services.
 - AGENT_ROLES env var: Runtime gate in entrypoint.sh — comma-separated list of roles the container runs.
 - Compose profiles: Local-model agents gated by profiles, requiring explicit --profile to start.
 State lives in three disconnected places: state files (CLI), env vars (runtime), compose services (docker). No single command unifies them.
 ## Complexity
 - Files touched: ~4 (bin/disinto, lib/hire-agent.sh, lib/generators.sh, docker/agents/entrypoint.sh)
 - Subsystems: CLI, compose generator, container entrypoint, project TOML schema
 - Estimated sub-issues: 4-5
 - Gluecode vs greenfield: ~80% gluecode (refactoring existing hire-agent.sh and CLI), ~20% greenfield (new agents list output, backend-first hire UX)
 ## Risks
 - Breaking existing hire-an-agent: The old command must keep working during transition. Operators may have scripts that call it. Deprecation path needed.
 - State migration: Existing local-model agents configured via agents TOML need to work unchanged. The new system reads the same TOML — no migration required if we keep the schema.
 - Entrypoint.sh hardcoded list: The 6 core agents are hardcoded in multiple places (entrypoint.sh, bin/disinto). Making this dynamic requires careful testing to avoid breaking the polling loop.
 - TOML parsing fragility: The hire-agent.sh TOML writer uses a Python inline script. Changes to the TOML schema could break parsing if not tested.
 ## Cost — new infra to maintain
 - No new services, cron jobs, or formulas. This is a refactor of existing CLI and configuration paths.
 - New code: disinto hire subcommand (~100 lines), disinto agents list subcommand (~80 lines), agent registry logic that unifies the three state sources (~50 lines).
 - Removed code: Portions of the current hire-an-agent that duplicate backend detection logic.
 - Ongoing: The hardcoded agent list in bin/disinto and entrypoint.sh becomes a derived list (from state files + TOML + compose). Slightly more complex discovery logic, but eliminates the need to update hardcoded lists when new agent types are added.
 ## Recommendation
 Worth it. This is a high-value, low-risk refactor that directly improves the adoption story. The current UX is the number one friction point for new operators — hire-an-agent requires knowing three things (name, role, backend) in the wrong order. The redesign makes the common case (disinto hire anthropic) a one-liner and gives operators visibility into what is running. No new infrastructure, no new dependencies, mostly gluecode over existing interfaces.
 Defer only if the team wants to stabilize the current agent set first (all 4 open architect sprints are pending human review). Otherwise, this is independent work that does not conflict with any in-flight sprint.
--- a/sprints/bug-report-pipeline.md
+++ b/sprints/bug-report-pipeline.md
@ -0,0 +1,54 @@
 # Sprint pitch: bug-report pipeline — inbound classification + auto-close
 ## Vision issues
 - #388 — end-to-end bug-report management — inbound classification, reproduction routing, and auto-close loop
 ## What this enables
 After this sprint, bug-reports flow through a **cheap classification gate** before reaching the expensive reproduce-agent. Inspection-class bugs (stack trace cited, cause obvious from code) go straight to dev-agent — saving the full Playwright/MCP environment spin-up. The auto-close loop fires reliably, and upstream Codeberg reporters get notified when their bug is fixed.
 Today: every bug-report → reproduce-agent (expensive). After: only ambiguous bugs → reproduce-agent; obvious bugs → dev-agent directly.
 ## What exists today
 The pipeline is 80% built:
 | Component | Status | Location |
 |-----------|--------|----------|
 | Gardener bug-report detection + enrichment | Complete | `formulas/run-gardener.toml:79-134` |
 | Reproduce-agent (Playwright MCP, exit gates) | Complete | `formulas/reproduce.toml`, `docker/reproduce/` |
 | Triage-agent (6-step root cause) | Complete | `formulas/triage.toml` |
 | Dev-poll label gating (skips `bug-report`) | Complete | `dev/dev-poll.sh` |
 | Auto-close decomposed parents | Complete (not firing) | `formulas/run-gardener.toml:224-269` |
 | Issue templates (bug.yaml, feature.yaml) | Complete | `.forgejo/ISSUE_TEMPLATE/` |
 | Manifest action system | Complete | `gardener/pending-actions.json` |
 Reusable infrastructure: formula-session.sh, agent-sdk.sh, issue-lifecycle.sh label helpers, parse-deps.sh dependency extraction, manifest-driven mutation pattern.
 ## Complexity
 - **5-6 sub-issues** estimated
 - **~8 files touched** across formulas, lib, and gardener
 - **Mostly gluecode** — extending existing gardener formula, adding a classification step, wiring auto-close reliability, adding upstream notification
 - **One new formula step** (inbound classifier in run-gardener.toml or a dedicated pre-check)
 - **No new containers or services** — classification runs inside existing gardener session
 ## Risks
 - **Classification accuracy** — the cheap pre-check might route ambiguous bugs to dev-agent, wasting dev cycles on bugs it can't fix without reproduction. Mitigation: conservative skip-reproduction criteria (all four pre-check questions must be clean).
 - **Gardener formula complexity** — run-gardener.toml is already the most complex formula. Adding classification logic increases cognitive load. Mitigation: classification could be a separate formula step with clear entry/exit gates.
 - **Upstream Codeberg notification** — requires Codeberg API token in `.env.vault.enc`. Currently in `.netrc` on host but not in containers. Needs vault action for the actual notification (AD-006 compliance).
 - **Auto-close timing** — if gardener runs are infrequent (every 6h), auto-close feedback loop is slow. Not a sprint problem per se, but worth noting.
 ## Cost — new infra to maintain
 - **One new gardener formula step** (inbound classification) — maintained alongside existing grooming step
 - **Bug taxonomy labels** (bohrbug, heisenbug, mandelbug, schrodinbug or simplified equivalents) — 2-4 new labels
 - **No new services, cron jobs, or agent roles** — everything runs within existing gardener cycle
 - **Codeberg notification vault action template** — one new TOML in `vault/examples/`
 ## Recommendation
 **Worth it.** The infrastructure is 80% built. This sprint fills the two concrete gaps (classification gate + auto-close reliability) with minimal new maintenance burden. The biggest value is avoiding unnecessary reproduce-agent runs — each one costs a full Claude session with Playwright MCP for bugs that could be triaged by reading code. The auto-close fix is nearly free (the logic exists, just needs the gardener to run reliably). Upstream notification is a small vault action addition.
 Defer the statistical reproduction mode (Heisenbug handling) and bulk deduplication to a follow-up sprint — they add complexity without proportional value at current bug volume.
--- a/sprints/edge-subpath-chat.md
+++ b/sprints/edge-subpath-chat.md
@ -0,0 +1,106 @@
 # Sprint: edge-subpath-chat
 ## Vision issues
 - #623 — vision: subpath routing + Forgejo-OAuth-gated Claude chat inside the edge container
 ## What this enables
 After this sprint, an operator running `disinto edge register` gets a single URL — `<project>.disinto.ai` — with Forgejo at `/forge/`, Woodpecker CI at `/ci/`, a staging preview at `/staging/`, and an OAuth-gated Claude Code chat at `/chat/`, all under one wildcard cert and one bootstrap password. The factory talks back to its operator through a chat window that sits next to the forge, CI, and live preview it is driving.
 ## What exists today
 The majority of this vision is already implemented across issues #704–#711:
 - **Subpath routing**: Caddyfile generator produces `/forge/*`, `/ci/*`, `/staging/*`, `/chat/*` handlers (`lib/generators.sh:780–822`). Forgejo `ROOT_URL` and Woodpecker `WOODPECKER_HOST` are set to subpath values when `EDGE_TUNNEL_FQDN` is present (`bin/disinto:842–847`).
 - **Chat container**: Full OAuth flow via Forgejo, HttpOnly session cookies, forward_auth defense-in-depth with `FORWARD_AUTH_SECRET`, per-user rate limiting (hourly/daily/token caps), conversation history in NDJSON (`docker/chat/server.py`).
 - **Sandbox hardening**: Read-only rootfs, `cap_drop: ALL`, `no-new-privileges`, `pids_limit: 128`, `mem_limit: 512m`, no Docker socket. Verification script at `tools/edge-control/verify-chat-sandbox.sh`.
 - **Edge control plane**: Tunnel registration, port allocation, Caddy admin API routing, wildcard `*.disinto.ai` cert via DNS-01 (`tools/edge-control/`).
 - **Dependencies #620/#621/#622**: Admin password prompt, edge control plane, and reverse tunnel — all implemented and merged.
 - **Subdomain fallback plan**: Fully documented at `docs/edge-routing-fallback.md` with pivot criteria.
 ## Complexity
 - ~6 files touched across 3 subsystems (Caddy routing, chat backend, compose generation)
 - Estimated 4 sub-issues
 - ~90% gluecode (wiring existing pieces), ~10% greenfield (WebSocket streaming, end-to-end smoke test)
 ## Risks
 - **Forgejo/Woodpecker subpath breakage**: Neither service is battle-tested under subpaths in this stack. Redirect loops, OAuth callback mismatches, or asset 404s are plausible. Mitigation: the fallback plan (`docs/edge-routing-fallback.md`) is already documented and estimated at under one day to pivot.
 - **Cookie/CSRF collision**: Forgejo and chat share the same origin — cookie names or CSRF tokens could collide. Mitigation: chat uses a namespaced cookie (`disinto_chat_session`) and a separate OAuth app.
 - **Streaming latency**: One-shot `claude --print` blocks until completion. Long responses leave the operator staring at a spinner. Not a correctness risk, but a UX risk that WebSocket streaming would fix.
 ## Cost — new infra to maintain
 - **No new services** — all containers already exist in the compose stack
 - **No new scheduled tasks or formulas** — chat is a passive request handler
 - **One new smoke test** (CI) — end-to-end subpath routing verification
 - **Ongoing**: monitoring Forgejo/Woodpecker upstream for subpath regressions on upgrades
 ## Recommendation
 Worth it. The vision is ~80% implemented. The remaining work is integration hardening (confirming subpath routing works end-to-end with real Forgejo/Woodpecker) and one UX improvement (WebSocket streaming). The risk is low because a documented fallback to per-service subdomains exists. Ship this sprint to close the loop on the edge experience.
 ## Sub-issues
 <!-- filer:begin -->
 - id: subpath-routing-smoke-test
  title: "vision(#623): end-to-end subpath routing smoke test for Forgejo + Woodpecker + chat"
  labels: [backlog]
  depends_on: []
  body: |
    ## Goal
    Verify that Forgejo, Woodpecker, and chat all function correctly when served
    under /forge/, /ci/, and /chat/ subpaths on a single domain. Catch redirect
    loops, OAuth callback failures, and asset 404s before they hit production.
    ## Acceptance criteria
    - [ ] Forgejo login at /forge/ completes without redirect loops
    - [ ] Forgejo OAuth callback for Woodpecker succeeds under subpath
    - [ ] Woodpecker dashboard loads all assets at /ci/ (no 404s on JS/CSS)
    - [ ] Chat OAuth login flow works at /chat/login
    - [ ] Forward_auth on /chat/* rejects unauthenticated requests with 401
    - [ ] Staging content loads at /staging/
    - [ ] Root / redirects to /forge/
    - [ ] CI pipeline added to .woodpecker/ to run this test on edge-related changes
 - id: websocket-streaming-chat
  title: "vision(#623): WebSocket streaming for chat UI to replace one-shot claude --print"
  labels: [backlog]
  depends_on: [subpath-routing-smoke-test]
  body: |
    ## Goal
    Replace the blocking one-shot claude --print invocation in the chat backend with
    a WebSocket connection that streams tokens to the UI as they arrive.
    ## Acceptance criteria
    - [ ] /chat/ws endpoint accepts WebSocket upgrade with valid session cookie
    - [ ] /chat/ws rejects upgrade if session cookie is missing or expired
    - [ ] Chat backend streams claude output over WebSocket as text frames
    - [ ] UI renders tokens incrementally as they arrive
    - [ ] Rate limiting still enforced on WebSocket messages
    - [ ] Caddy proxies WebSocket upgrade correctly through /chat/ws with forward_auth
 - id: chat-working-dir-scoping
  title: "vision(#623): scope Claude chat working directory to project staging checkout"
  labels: [backlog]
  depends_on: [subpath-routing-smoke-test]
  body: |
    ## Goal
    Give the chat container Claude session read-write access to the project working
    tree so the operator can inspect, explain, or modify code — scoped to that tree
    only, with no access to factory internals, secrets, or Docker socket.
    ## Acceptance criteria
    - [ ] Chat container bind-mounts the project working tree as a named volume
    - [ ] Claude invocation in server.py sets cwd to the workspace directory
    - [ ] Claude permission mode is acceptEdits (not bypassPermissions)
    - [ ] verify-chat-sandbox.sh updated to assert workspace mount exists
    - [ ] Compose generator adds the workspace volume conditionally
 - id: subpath-fallback-automation
  title: "vision(#623): automate subdomain fallback pivot if subpath routing fails"
  labels: [backlog]
  depends_on: [subpath-routing-smoke-test]
  body: |
    ## Goal
    If the smoke test reveals unfixable subpath issues, automate the pivot to
    per-service subdomains so the switch is a single config change.
    ## Acceptance criteria
    - [ ] generators.sh _generate_caddyfile_impl accepts EDGE_ROUTING_MODE env var
    - [ ] In subdomain mode, Caddyfile emits four host blocks per edge-routing-fallback.md
    - [ ] register.sh registers additional subdomain routes when EDGE_ROUTING_MODE=subdomain
    - [ ] OAuth redirect URIs in ci-setup.sh respect routing mode
    - [ ] .env template documents EDGE_ROUTING_MODE with a comment referencing the fallback doc
 <!-- filer:end -->
--- a/sprints/example-project-lifecycle.md
+++ b/sprints/example-project-lifecycle.md
@ -0,0 +1,62 @@
 # Sprint: example project — full lifecycle demo
 ## Vision issues
 - #697 — vision: example project demonstrating the full disinto lifecycle
 ## What this enables
 After this sprint, a new user can see disinto working end-to-end on a real project:
 `disinto init` → seed issues appear → dev-agent picks one up → PR opens → CI runs →
 review-agent approves → merge → repeat. The example repo serves as both proof-of-concept
 and onboarding reference.
 This unblocks:
 - **Adoption — Example project demonstrating full lifecycle** (directly)
 - **Adoption — Landing page** (indirectly — the example is the showcase artifact)
 - **Contributors** (lower barrier — people can see how disinto works before trying it)
 ## What exists today
 - `disinto init <url>` fully bootstraps a project: creates repos, ops repo, branch protection,
  issue templates, VISION.md template, docker-compose stack, cron scheduling
 - Dev-agent pipeline is proven: issue → branch → implement → PR → CI → review → merge
 - Review-agent, gardener, supervisor all operational
 - Project TOML templates exist (`projects/*.toml.example`)
 - Issue template for bug reports exists; `disinto init` copies it to target repos
 What's missing: an actual example project repo with seed content and seed issues that
 demonstrate the loop.
 ## Complexity
 Files touched: 3-5 in the disinto repo (documentation, possibly `disinto init` tweaks)
 New artifacts: 1 example project repo with seed files, 3-5 seed issues
 Subsystems: bootstrap, dev-agent, CI, review
 Sub-issues: 3-4
 Gluecode ratio: ~70% content/documentation, ~30% scripting
 ## Risks
 - **Maintenance burden**: The example project must stay working as disinto evolves.
  If `disinto init` changes, the example may break. Mitigation: keep the example
  minimal so there's less surface to break.
 - **CI environment**: The example needs a working Woodpecker pipeline. If the
  example uses a language that needs a specific Docker image in CI, that's a dependency.
  Mitigation: choose a language/stack with zero build dependencies.
 - **Seed issue quality**: If seed issues are too vague, the dev-agent will refuse them
  (`underspecified`). If too trivial, the demo doesn't impress. Need a sweet spot.
 - **Scope creep**: "Full lifecycle" could mean bootstrap-to-deploy. For this sprint,
  scope to bootstrap-to-merge. Deploy profiles are a separate milestone.
 ## Cost — new infra to maintain
 - One example project repo (hosted on Forgejo, mirrored to Codeberg/GitHub)
 - Seed issues re-created on each fresh `disinto init` run (or documented as manual step)
 - No new services, agents, or cron jobs
 ## Recommendation
 Worth it. This is the single most impactful adoption artifact. A working example
 answers "does this actually work?" in a way that documentation cannot. The example
 should be dead simple — a static site or a shell-only project — so it works on any
 machine without language-specific dependencies.
--- a/sprints/gatekeeper-agent.md
+++ b/sprints/gatekeeper-agent.md
@ -0,0 +1,58 @@
 # Sprint: gatekeeper agent
 ## Vision issues
 - #485 — gatekeeper agent: verify external signals before they enter the factory
 ## What this enables
 The factory gains a trust boundary between external mirrors (Codeberg, GitHub) and the internal Forgejo issue tracker. Today, external bug reports and feature requests must be manually copied into internal Forgejo — there is no automated inbound path from mirrors. The gatekeeper closes this gap by:
 1. Polling external mirrors for new issues
 2. Verifying claims against internal ground truth (git history, agent logs, system state)
 3. Creating sanitized internal issues only when claims are confirmed
 4. Defending against prompt injection via issue body rewriting (evidence-based, never verbatim copy)
 This directly supports the Growth goals: "attract developers", "contributors", "lower the barrier to entry" — external contributors can file issues on public mirrors and the factory processes them safely.
 ## What exists today
 - **Mirror push** (lib/mirrors.sh): outbound only — pushes primary branch + tags to configured mirrors after each merge. No inbound path.
 - **Bug-report pipeline** (docker/edge/dispatcher.sh): reproduce -> triage -> verify agents handle internal bug reports. The gatekeeper slots upstream of this pipeline.
 - **Secret scanning** (lib/secret-scan.sh): detects and redacts secrets in issue bodies before posting. Reusable for gatekeeper sanitization.
 - **Issue lifecycle** (lib/issue-lifecycle.sh): claim model, dependency checking, label filtering. Gatekeeper follows the same patterns.
 - **Per-agent identities** (lib/hire-agent.sh): Forgejo bot accounts with dedicated tokens. Gatekeeper gets its own identity.
 - **Agent run pattern**: all slow agents follow the same *-run.sh + formula pattern (gardener-run.sh, planner-run.sh, etc.). Gatekeeper follows this exactly.
 - **Vault dispatch** for external actions: PR-based approval workflow. External API tokens (GITHUB_TOKEN, CODEBERG_TOKEN) are vault-only per AD-006.
 ## Complexity
 - **New files (~5):** gatekeeper/gatekeeper-run.sh, gatekeeper/AGENTS.md, formulas/run-gatekeeper.toml, state marker
 - **Modified files (~3):** docker/agents/entrypoint.sh (polling registration), AGENTS.md (agent table), docker-compose env vars
 - **Gluecode vs greenfield:** ~70% gluecode (agent scaffolding, polling, issue creation reuse existing patterns), ~30% greenfield (external API polling, claim verification logic, prompt injection defense)
 - **Estimated sub-issues:** 5-7
 ## Risks
 1. **Prompt injection via crafted issue bodies** — primary risk. External issue bodies are untrusted input that could manipulate agents if copied verbatim. Mitigation: gatekeeper rewrites issues based on verified evidence, never copies external content. The internal issue body is authored by the gatekeeper, not the reporter.
 2. **AD-006 tension** — the gatekeeper needs READ access to external APIs (GitHub Issues API, Codeberg Issues API), which requires GITHUB_TOKEN and CODEBERG_TOKEN. These are vault-only tokens per AD-006. The token access pattern is a design fork (see below).
 3. **False positives/negatives** — gatekeeper may reject legitimate reports (false positive) or admit crafted ones (false negative). Mitigated by the existing reproduce/triage pipeline downstream — the gatekeeper is a first filter, not the only filter.
 4. **Rate limiting** — GitHub API has 5000 req/hour (authenticated). Codeberg (Gitea) is typically more permissive. Hourly polling with pagination is well within limits.
 5. **Scope creep** — the gatekeeper could expand into a full external community management tool. Sprint should be scoped to: poll -> verify -> create internal issue. No auto-responses to external reporters in v1.
 ## Cost — new infra to maintain
 - **1 new agent** in the polling loop (gatekeeper-run.sh, ~6h interval like gardener/architect)
 - **1 new Forgejo bot account** (gatekeeper-bot)
 - **1 new formula** (run-gatekeeper.toml)
 - **External API dependency** — relies on GitHub/Codeberg APIs being available. Failure is graceful (skip poll, retry next interval).
 - **No new services, no new containers** — runs in the existing agents container.
 ## Recommendation
 **Worth it.** This is the missing inbound path for the mirror infrastructure that already exists. The outbound side (mirror push) has been working since early in the project. The gatekeeper completes the loop: code goes out via mirrors, feedback comes back via gatekeeper. The implementation is mostly gluecode following established patterns, with the greenfield work concentrated in the verification logic and prompt injection defense — both well-scoped problems.
 The main decision point is how to handle AD-006 (external token access). This is a design fork that needs human input before sub-issues can be filed.
--- a/sprints/process-evolution-lifecycle.md
+++ b/sprints/process-evolution-lifecycle.md
@ -0,0 +1,62 @@
 # Sprint: process evolution lifecycle
 ## Vision issues
 - #418 — Process evolution: observe-propose-shadow-promote lifecycle
 ## What this enables
 After this sprint, the factory can **safely mutate its own processes**. Today, agents observe project state (stuck issues, stale evidence) but not process state (how long do reviews take? how often do dev sessions fail on the same pattern? which formula steps are bottlenecks?). Process changes are manual edits to formulas with no testing path.
 After this sprint:
 - Agents collect structured process metrics (review latency, failure rates, escalation frequency)
 - The predictor can propose process changes as structured RFCs with evidence
 - Process changes can shadow-run alongside the current process before promotion
 - Humans gate the shadow-to-promote transition via the existing vault/PR approval pattern
 The factory becomes self-improving in a controlled, reversible way.
 ## What exists today
 Strong foundations — most of the lifecycle has analogues already:
 | Stage | Existing analogue | Gap |
 |-------|------------------|-----|
 | **Observe** | Predictor scans project state; knowledge graph does structural analysis | No process-state metrics (latency, failure rates, throughput) |
 | **Propose** | Prediction-triage workflow (predictor files, planner triages) | Predictions are about project state, not process changes; no RFC format |
 | **Shadow** | Nothing | No infrastructure to run two processes in parallel and compare |
 | **Promote** | Vault PR approval; sprint ACCEPT/REJECT | Not wired to process lifecycle |
 Additional existing infrastructure:
 - `.profile/lessons-learned.md` captures per-agent learning (abstract patterns)
 - `ops/knowledge/planner-memory.md` persists planner observations across runs
 - `docs/EVIDENCE-ARCHITECTURE.md` defines sense vs mutation processes
 - Formulas (`formulas/*.toml`) define processes but have no versioning
 ## Complexity
 - **Files touched**: ~6-8 (knowledge graph, predictor formula, planner formula, new process-metrics formula, evidence architecture docs, ops repo RFC directory)
 - **Subsystems**: predictor, planner, knowledge graph, formula-session, evidence pipeline
 - **Estimated sub-issues**: 6-8
 - **Gluecode vs greenfield**: ~60% gluecode (extending prediction-triage, adding graph nodes, wiring evidence collection) / ~40% greenfield (process metrics collector, RFC format, shadow-run comparator)
 ## Risks
 1. **Compute cost**: Shadow-running doubles resource usage during shadow periods. Needs a time-bound or cycle-bound cap.
 2. **Wrong metrics**: Process metrics must be carefully chosen — optimizing for speed could sacrifice quality. The predictor's existing "evidence strength" checks provide a model.
 3. **Scope creep**: "Process evolution" could expand endlessly. This sprint should deliver the pipeline (observe, propose, shadow, promote) for ONE process as proof-of-concept, not all processes at once.
 4. **Over-engineering risk**: The factory has ~10 agents, not 1000 microservices. The mechanism should be proportional to the system's complexity. A lightweight RFC-in-ops-repo approach is better than a framework.
 ## Cost — new infra to maintain
 - **Process metrics formula** (`formulas/collect-process-metrics.toml`): new formula, runs on predictor/planner schedule. Collects from git log, CI API, and issue timeline.
 - **RFC directory** (`ops/process-rfcs/`): new directory in ops repo. Low maintenance — just markdown files.
 - **Shadow-run comparator**: new step in formula-session.sh that can fork a formula step between current and candidate implementations. Needs cleanup logic for shadow artifacts.
 - **No new services or containers** — this extends existing agent capabilities, doesn't add new ones.
 ## Recommendation
 **Worth it — but scope tightly to one proof-of-concept process.**
 The prediction-triage workflow already implements observe-to-propose. Extending it to include shadow-to-promote is a natural evolution, not a leap. The key risk is scope creep — this sprint should deliver the pipeline for ONE process mutation (e.g., "skip review for docs-only PRs" or "auto-close stale predictions after 7 days") and prove the lifecycle works end-to-end.
 Defer: building a generic process evolution framework. The first sprint proves the pattern; generalization comes later if the pattern holds.
--- a/sprints/supervisor-project-wide-oversight.md
+++ b/sprints/supervisor-project-wide-oversight.md
@ -0,0 +1,42 @@
 # Sprint: supervisor project-wide oversight
 ## Vision issues
 - #540 — supervisor should have project-wide oversight, not just self-monitoring
 ## What this enables
 After this sprint, the supervisor can:
 1. Discover all Docker Compose stacks on the deployment box — not just the disinto factory
 2. Attribute resource pressure to specific stacks — "harb-anvil-1 grew 12 GB" instead of "disk at 98%"
 3. Surface cross-stack symptoms (restarting containers, unhealthy services, volume bloat) without per-project knowledge
 4. Coordinate remediation through vault items naming the stack owner, rather than blindly pruning
 This turns the supervisor from a single-project health monitor into a deployment-box health monitor — critical because factory deployments coexist with the projects they supervise.
 ## What exists today
 - preflight.sh (227 lines) — already collects RAM, disk, load, docker ps, CI, PRs, issues, locks, phase files, worktrees, vault items. Easy to extend.
 - run-supervisor.toml — priority framework (P0-P4) with auto-fix vs. vault-item escalation. New cross-stack rules slot into existing tiers.
 - Edge container — has docker socket access, docker CLI installed. Can run docker compose ls, docker stats, docker system df.
 - projects/*.toml — per-project config with [services].containers field. Could be extended for sibling stack ownership.
 - AD-006 — external actions go through vault. Supervisor reports foreign stack symptoms but does not auto-remediate.
 - docker system prune -f — already runs as P1 auto-fix. Currently affects all images symmetrically (the problem this sprint solves).
 ## Complexity
 - Files touched: 3-4 (preflight.sh, run-supervisor.toml, projects/*.toml schema, new knowledge/sibling-stacks.md)
 - Subsystems: supervisor only — no changes to other agents
 - Estimated sub-issues: 5-6
 - Gluecode vs greenfield: 80/20 (extending existing preflight sections and priority rules vs. stack ownership model)
 ## Risks
 1. Docker socket blast radius — mitigated by read-only discovery commands; write actions stay vault-gated for foreign stacks.
 2. docker system prune collateral — scoping prune to disinto-managed images requires label-based filtering (com.disinto.managed=true), factory images need labeling first.
 3. Performance of docker stats — mitigated by --no-stream --format for a single snapshot.
 4. Stack ownership ambiguity — no standard way to identify who owns a foreign compose project. Design fork needed.
 ## Cost — new infra to maintain
 - No new services, cron jobs, or containers. Extends the existing supervisor.
 - New knowledge file: knowledge/sibling-stacks.md (low maintenance).
 - Optional TOML schema extension: [siblings] section in project config.
 - Image labeling convention: com.disinto.managed=true on factory Dockerfiles and compose.
 ## Recommendation
 Worth it. Addresses a real incident (harb-dev-box 98% disk), mostly gluecode extending proven patterns, adds no new services, directly supports Foundation milestone. The one-box-many-stacks model is the common case for resource-constrained dev environments.
--- a/sprints/vault-blast-radius-tiers.md
+++ b/sprints/vault-blast-radius-tiers.md
@ -1,177 +1,77 @@
-# Sprint: Vault blast-radius tiers
+# Sprint: vault blast-radius tiers
 ## Vision issues
 - #419 — Vault: blast-radius based approval tiers
 ## What this enables
-After this sprint, vault operations are classified by blast radius — low-risk operations
+After this sprint, low-tier vault actions execute without waiting for a human. The dispatcher
-(docs, feature-branch edits) flow through without human gating; medium-risk operations
+auto-approves and merges vault PRs classified as `low` in `policy.toml`. Medium and high tiers
-(CI config, Dockerfile changes) queue for async review; high-risk operations (production
+are unchanged: medium notifies and allows async review; high blocks until admin approves.
 deploys, secrets rotation, agent self-modification) hard-block as today.
-The practical effect: the dev loop no longer stalls waiting for human approval of routine
+This removes the bottleneck on low-risk bookkeeping operations while preserving the hard gate
-operations. Agents can move autonomously through 80%+ of vault requests while preserving
+on production deploys, secret operations, and agent self-modification.
 the safety contract on irreversible operations.
 ## What exists today
 The vault redesign (#73-#77) is complete and all five issues are closed:
 - lib/vault.sh - idempotent vault PR creation via Forgejo API
 - docker/edge/dispatcher.sh - polls merged vault PRs, verifies admin approval, launches runners
 - vault/vault-env.sh - TOML validation for vault action files
 - vault/SCHEMA.md - vault action TOML schema
 - lib/branch-protection.sh - admin-only merge enforcement on ops repo
-Currently every vault request goes through the same hard-block path regardless of risk.
+The tier infrastructure is fully built. Only the enforcement is missing.
-No classification layer exists. All formulas share the same single approval tier.
+
 - `vault/policy.toml` — Maps every formula to low/medium/high. Current low tier: groom-backlog,
  triage, reproduce, review-pr. Medium: dev, run-planner, run-gardener, run-predictor,
  run-supervisor, run-architect, upgrade-dependency. High: run-publish-site, run-rent-a-human,
  add-rpc-method, release.
 - `vault/classify.sh` — Shell classifier called by `vault-env.sh`. Returns tier for a given formula.
 - `vault/SCHEMA.md` — Documents `blast_radius` override field (string: "low"/"medium"/"high")
  that vault action TOMLs can use to override policy defaults.
 - `vault/validate.sh` — Validates vault action TOML fields including blast_radius.
 - `docker/edge/dispatcher.sh` — Edge dispatcher. Polls ops repo for merged vault PRs and executes
  them. Currently fires ALL merged vault PRs without tier differentiation.
 What's missing: the dispatcher does not read blast_radius, does not auto-approve low-tier PRs,
 and does not differentiate notification behavior for medium vs high tier.
 ## Complexity
-Files touched: ~14 (7 new, 7 modified)
+
-Gluecode vs greenfield: ~60% gluecode, ~40% greenfield.
+Files touched: 3
-Estimated sub-issues: 4-7 depending on fork choices.
+- `docker/edge/dispatcher.sh` — read blast_radius from vault action TOML; for low tier, call
  Forgejo API to approve + merge the PR directly (admin token); for medium, post "pending async
  review" comment; for high, leave pending (existing behavior)
 - `lib/vault.sh` `vault_request()` — include blast_radius in the PR body so the dispatcher
  can read it without re-parsing the TOML
 - `docs/VAULT.md` — document the three-tier behavior for operators
 Sub-issues: 3
 Gluecode ratio: ~70% gluecode (dispatcher reads existing classify.sh output), ~30% new (auto-approve API call, comment logic)
 ## Risks
 1. Classification errors on consequential operations. Default-deny mitigates: unknown formula → high.
 2. Dispatcher complexity. Mitigation: extract to classify.sh, dispatcher delegates.
 3. Branch-protection interaction (primary design fork, see below).
-## Cost - new infra to maintain
+- Admin token for auto-approve: the dispatcher needs an admin-level Forgejo token to approve
- vault/policy.toml or blast_radius fields — operators update when adding formulas.
+  and merge PRs. Currently `FORGE_TOKEN` is used; branch protection has `admin_enforced: true`
- vault/classify.sh — one shell script, shellcheck-covered, no runtime daemon.
+  which means even admin bots are blocked from bypassing the approval gate. This is the core
- No new services, cron jobs, or agent roles.
+  design fork: either (a) relax admin_enforced for low-tier PRs, or (b) use a separate
  Forgejo "auto-approver" account with admin rights, or (c) bypass the PR workflow entirely
  for low-tier actions (execute directly without a PR).
 - Policy drift: as new formulas are added, policy.toml must be updated. If a formula is missing,
  classify.sh should default to "high" (fail safe). Currently the default behavior is unknown —
  this needs to be hardened.
 - Audit trail: low-tier auto-approvals should still leave a record. Auto-approve comment
  ("auto-approved: low blast radius") satisfies this.
 ## Cost — new infra to maintain
 - One new Forgejo account or token (if auto-approver route chosen) — needs rotation policy
 - `policy.toml` maintenance: every new formula must be classified before shipping
 - No new services, cron jobs, or containers
 ## Recommendation
 Worth it. Vault redesign done; blast-radius tiers are the natural next step. Primary reason
 agents cannot operate continuously is that every vault action blocks on human availability.
---
+Worth it, but the design fork on auto-approve mechanism must be resolved before implementation
 begins — this is the questions step.
-## Design forks
+The cleanest approach is option (c): bypass the PR workflow for low-tier actions entirely.
 The dispatcher detects blast_radius=low, executes the formula immediately without creating
 a PR, and writes to `vault/fired/` directly. This avoids the admin token problem, preserves
 the PR workflow for medium/high, and keeps the audit trail in git. However, it changes the
 blast_radius=low behavior from "PR exists but auto-merges" to "no PR, just executes" — operators
 need to understand the difference.
-Three decisions must be made before implementation begins.
+The PR route (option b) is more visible but requires a dedicated account.
 ### Fork 1 (Critical): Auto-approve merge mechanism
 Branch protection on the ops repo requires `required_approvals: 1` and `admin_enforced: true`.
 For low-tier vault PRs, the dispatcher must merge without a human approval.
 **A. Skip PR entirely for low-tier**
 vault-bot commits directly to `vault/actions/` on main using admin token. No PR created.
 Dispatcher detects new TOML file by absence of `.result.json`.
 - Simplest dispatcher code
 - No PR audit trail for low-tier executions
 - `FORGE_ADMIN_TOKEN` already exists in vault env (used by `is_user_admin()`)
 **B. Dispatcher self-approves low-tier PRs**
 vault-bot creates PR as today, then immediately posts an APPROVED review using its own token,
 then merges. vault-bot needs Forgejo admin role so `admin_enforced: true` does not block it.
 - Full PR audit trail for all tiers
 - Requires granting vault-bot admin role on Forgejo
 **C. Tier-aware branch protection**
 Create a separate Forgejo protection rule for `vault/*` branch pattern with `required_approvals: 0`.
 Main branch protection stays unchanged. vault-bot merges low-tier PRs directly.
 - No new accounts or elevated role for vault-bot
 - Protection rules are in Forgejo admin UI, not code (harder to version)
 - Forgejo `vault/*` glob support needs verification
 **D. Dedicated auto-approve bot**
 Create a `vault-auto-bot` Forgejo account with admin role that auto-approves low-tier PRs.
 Cleanest trust separation; most operational overhead.
 ---
 ### Fork 2 (Secondary): Policy storage format
 Where does the formula → tier mapping live?
 **A. `vault/policy.toml` in disinto repo**
 Flat TOML: `formula = "tier"`. classify.sh reads it at runtime.
 Unknown formulas default to `high`. Changing policy requires a disinto PR.
 **B. `blast_radius` field in each `formulas/*.toml`**
 Add `blast_radius = "low"|"medium"|"high"` to each formula TOML.
 classify.sh reads the target formula TOML for its tier.
 Co-located with formula — impossible to add a formula without declaring its risk.
 **C. `vault/policy.toml` in ops repo**
 Same format as A but lives in the ops repo. Operators update without a disinto PR.
 Useful for per-deployment overrides.
 **D. Hybrid: formula TOML default + ops override**
 Formula TOML carries a default tier. Ops `vault/policy.toml` can override per-deployment.
 Most flexible; classify.sh must merge two sources.
 ---
 ### Fork 3 (Secondary): Medium-tier dev-loop behavior
 When dev-agent creates a vault PR for a medium-tier action, what does it do while waiting?
 **A. Non-blocking: fire and continue immediately**
 Agent creates vault PR and moves to next issue without waiting.
 Maximum autonomy; sequencing becomes unpredictable.
 **B. Soft-block with 2-hour timeout**
 Agent waits up to 2 hours polling for vault PR merge. If no response, continues.
 Balances oversight with velocity.
 **C. Status-quo block (medium = high)**
 Medium-tier blocks the agent loop like high-tier today. Only low-tier actions unblocked.
 Simplest behavior change — no modification to dev-agent flow needed.
 **D. Label-based approval signal**
 Agent polls for a `vault-approved` label on the vault PR instead of waiting for merge.
 Decouples "approved to continue" from "PR merged and executed."
 ---
 ## Proposed sub-issues
 ### Core (always filed regardless of fork choices)
 **Sub-issue 1: vault/classify.sh — classification engine**
 Implement `vault/classify.sh`: reads formula name, secrets, optional `blast_radius` override,
 applies policy rules, outputs tier (`low|medium|high`). Default-deny: unknown → `high`.
 Files: `vault/classify.sh` (new), `vault/vault-env.sh` (call classify)
 **Sub-issue 2: docs/BLAST-RADIUS.md and SCHEMA.md update**
 Write `docs/BLAST-RADIUS.md`. Add optional `blast_radius` field to `vault/SCHEMA.md`
 and validator.
 Files: `docs/BLAST-RADIUS.md` (new), `vault/SCHEMA.md`, `vault/vault-env.sh`
 **Sub-issue 3: Update prerequisites.md**
 Mark vault redesign (#73-#77) as DONE (stale). Add blast-radius tiers to the tree.
 Files: `disinto-ops/prerequisites.md`
 ### Fork 1 variants (pick one)
 **1A** — Modify `lib/vault.sh` to skip PR for low-tier, commit directly to main.
 Modify `dispatcher.sh` to skip `verify_admin_merged()` for low-tier TOMLs.
 **1B** — Modify `dispatcher.sh` to post APPROVED review + merge for low-tier.
 Grant vault-bot admin role in Forgejo setup scripts.
 **1C** — Add `setup_vault_branch_protection_tiered()` to `lib/branch-protection.sh`
 with `required_approvals: 0` for `vault/*` pattern (verify Forgejo glob support first).
 **1D** — Add `vault-auto-bot` account to `forge-setup.sh`. Implement approval watcher.
 ### Fork 2 variants (pick one)
 **2A** — Create `vault/policy.toml` in disinto repo. classify.sh reads it.
 **2B** — Add `blast_radius` field to all 15 `formulas/*.toml`. classify.sh reads formula TOML.
 **2C** — Create `disinto-ops/vault/policy.toml`. classify.sh reads ops copy at runtime.
 **2D** — Two-pass classify.sh: formula TOML default, ops policy override.
 ### Fork 3 variants (pick one)
 **3A** — Non-blocking: `lib/vault.sh` returns immediately after PR creation for all tiers.
 **3B** — Soft-block: poll medium-tier PR every 15 min for up to 2 hours.
 **3C** — No change: medium-tier behavior unchanged (only low-tier unblocked).
 **3D** — Create `vault-approved` label. Modify `lib/vault.sh` medium path to poll label.
--- a/sprints/versioned-agent-images.md
+++ b/sprints/versioned-agent-images.md
@ -0,0 +1,233 @@
 # Sprint: versioned agent images
 ## Vision issues
 - #429 — feat: publish versioned agent images — compose should use image: not build:
 ## What this enables
 After this sprint, `disinto init` produces a `docker-compose.yml` that pulls a pinned image
 from a registry instead of building from source. A new factory instance needs only a token
 and a config file — no clone, no build, no local Docker context. This closes the gap between
 "works on my machine" and "one-command bootstrap."
 It also enables rollback: if agents misbehave after an upgrade, `AGENTS_IMAGE=v0.1.1 disinto up`
 restores the previous version without touching the codebase.
 ## What exists today
 The release pipeline is more complete than it looks:
 - `formulas/release.toml` — 7-step release formula. Steps 4-5 already build and tag the image
  locally (`docker compose build --no-cache agents`, `docker tag disinto-agents disinto-agents:$RELEASE_VERSION`).
  The gap: no push step, no registry target.
 - `lib/release.sh` — Creates vault TOML and ops repo PR for the release. No image version wired
  into compose generation.
 - `lib/generators.sh` `_generate_compose_impl()` — Generates compose with `build: context: .
  dockerfile: docker/agents/Dockerfile` for agents, runner, reproduce, edge. Version-unaware.
 - `vault/vault-env.sh` — `DOCKER_HUB_TOKEN` is in `VAULT_ALLOWED_SECRETS`. Not currently used.
 - `docker/agents/Dockerfile` — No VOLUME declarations; runtime state, repos, and config are
  mounted via compose but not declared. Claude binary injected by compose at init time.
 ## Complexity
 Files touched: 4
 - `formulas/release.toml` — add `push-image` step (after tag-image, before restart-agents)
 - `lib/generators.sh` — `_generate_compose_impl()` reads `AGENTS_IMAGE` env var; emits
  `image:` when set, falls back to `build:` when not set (dev mode)
 - `docker/agents/Dockerfile` — add explicit VOLUME declarations for /home/agent/data,
  /home/agent/repos, /home/agent/disinto/projects, /home/agent/disinto/state
 - `bin/disinto` `disinto_up()` — pass `AGENTS_IMAGE` through to compose if set in `.env`
 Subsystems: release formula, compose generation, Dockerfile hygiene
 Sub-issues: 3
 Gluecode ratio: ~80% gluecode (release step, VOLUME declarations), ~20% new (AGENTS_IMAGE env var path)
 ## Risks
 - Registry credentials: `DOCKER_HUB_TOKEN` is in vault allowlist but not wired up. The push step
  needs a registry login — either Docker Hub (DOCKER_HUB_TOKEN) or GHCR (GITHUB_TOKEN, already
  in vault). The sprint spec must pick one and add the credential to the release vault TOML.
 - Volume shadow: if VOLUME declarations don't match the compose volume mounts exactly, runtime
  files land in anonymous volumes instead of named ones. Must test before shipping.
 - Existing deployments: currently on `build:`. Migration: set AGENTS_IMAGE in .env, re-run
  `disinto init` (compose is regenerated), restart. No SSH, no worktree needed.
 - `runner` service: same image as agents, same version. Must update runner service in compose gen too.
 ## Cost — new infra to maintain
 - Registry account + token rotation: one vault secret (DOCKER_HUB_TOKEN) needs rotation policy.
  GHCR (via GITHUB_TOKEN) has no additional account but ties release to GitHub.
 - Release formula grows from 7 to 8 steps. Small maintenance surface.
 - `AGENTS_IMAGE` becomes a documented env var in .env for pinned deployments. Needs docs.
 ## Recommendation
 Worth it. The release formula is 90% done — one push step closes the gap. The compose
 generation change is purely additive (AGENTS_IMAGE env var, fallback to build: for dev).
 Volume declarations are hygiene that should exist regardless of versioning.
 Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo.
 No new account needed.
 ## Side effects of this sprint
 Beyond versioned images, this sprint indirectly closes one open bug:
 - **#665 (edge cold-start race)** — `disinto-edge` currently exits with code 128 on a cold
  `disinto up` because its entrypoint clones from `forgejo:3000` before forgejo's HTTP
  listener is up. Once edge's image embeds the disinto source at build time (no runtime
  clone), the race vanishes. The `depends_on: { forgejo: { condition: service_healthy } }`
  workaround proposed in #665 becomes unnecessary.
  Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in
  parallel — it would be churn this sprint deletes anyway.
 ## What this sprint does not yet enable
 This sprint delivers versioned images and pinned compose. It is a foundation, not the
 whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style
 client boxes — each independently scopable, with the dependency chain noted.
 ### Follow-up A: `disinto upgrade <version>` subcommand
 **Why**: even with versioned images, an operator on a client box still has to coordinate
 multiple steps to upgrade — `git fetch && git checkout`, edit `.env` to set
 `AGENTS_IMAGE`, re-run `_generate_compose_impl`, `docker compose pull`,
 `docker compose up -d --force-recreate`, plus any out-of-band migrations. There is no
 single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human
 operation that drifts out of sync.
 **Shape**:
 ```
 disinto upgrade v0.3.0
 ```
 Sequence (roughly):
 1. `git fetch --tags` and verify the tag exists
 2. Bail if the working tree is dirty
 3. `git checkout v0.3.0`
 4. `_env_set_idempotent AGENTS_IMAGE v0.3.0 .env` (helper from #641)
 5. Re-run `_generate_compose_impl` (picks up the new image tag)
 6. Run pre-upgrade migration hooks (Follow-up C)
 7. `docker compose pull && docker compose up -d --force-recreate`
 8. Run post-upgrade migration hooks
 9. Health check; rollback to previous version on failure
 10. Log result
 **Files touched**: `bin/disinto` (~150 lines, new `disinto_upgrade()` function), possibly
 extracted to a new `lib/upgrade.sh` if it grows large enough to warrant separation.
 **Dependency**: this sprint (needs `AGENTS_IMAGE` to be a real thing in `.env` and in the
 compose generator).
 ### Follow-up B: unify `DISINTO_VERSION` and `AGENTS_IMAGE`
 **Why**: today there are two version concepts in the codebase:
 - `DISINTO_VERSION` — used at `docker/edge/entrypoint-edge.sh:84` for the in-container
  source clone (`git clone --branch ${DISINTO_VERSION:-main}`). Defaults to `main`. Also
  set in the compose generator at `lib/generators.sh:397` for the edge service.
 - `AGENTS_IMAGE` — proposed by this sprint for the docker image tag in compose.
 These should be **the same value**. If you are running the `v0.3.0` agents image, the
 in-container source (if any clone still happens) should also be at `v0.3.0`. Otherwise
 you get a v0.3.0 binary running against v-something-else source, which is exactly the
 silent drift versioning is meant to prevent.
 After this sprint folds source into the image, `DISINTO_VERSION` in containers becomes
 vestigial. The follow-up: pick one name (probably keep `DISINTO_VERSION` since it is
 referenced in more places), have `_generate_compose_impl` set both `image:` and the env
 var from the same source, and delete the redundant runtime clone block in
 `entrypoint-edge.sh`.
 **Files touched**: `lib/generators.sh`, `docker/edge/entrypoint-edge.sh` (delete the
 runtime clone block once the image carries source), possibly `lib/env.sh` for the
 default value.
 **Dependency**: this sprint.
 ### Follow-up C: migration framework for breaking changes
 **Why**: some upgrades have side effects beyond "new code in the container":
 - The CLAUDE_CONFIG_DIR migration (#641 → `setup_claude_config_dir` in
  `lib/claude-config.sh`) needs a one-time `mkdir + mv + symlink` per host.
 - The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume
  URL repair.
 - Future: schema changes in the vault, ops repo restructures, env var renames.
 There is no `disinto/migrations/v0.3.0.sh` style framework. Existing migrations live
 ad-hoc inside `disinto init` and run unconditionally on init. That works for fresh
 installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations
 v0.2.1 → v0.2.2 → v0.3.0 to run in order".
 **Shape**: a `migrations/` directory with one file per version (`v0.3.0.sh`,
 `v0.3.1.sh`, …). `disinto upgrade` (Follow-up A) invokes each migration file in order
 between the previous applied version and the target. Track the applied version in
 `.env` (e.g. `DISINTO_LAST_MIGRATION=v0.3.0`) or in `state/`. Standard
 rails/django/flyway pattern. The framework itself is small; the value is in having a
 place for migrations to live so they are not scattered through `disinto init` and lost
 in code review.
 **Files touched**: `lib/upgrade.sh` (the upgrade command is the natural caller), new
 `migrations/` directory, a tracking key in `.env` for the last applied migration
 version.
 **Dependency**: Follow-up A (the upgrade command is the natural caller).
 ### Follow-up D: bootstrap-from-broken-state runbook
 **Why**: this sprint and Follow-ups A–C describe the steady-state upgrade flow. But
 existing client boxes — harb-dev-box specifically — are not in steady state. harb's
 working tree is at tag `v0.2.0` (months behind main). Its containers are running locally
 built `:latest` images of unknown vintage. Some host-level state (`CLAUDE_CONFIG_DIR`,
 `~/.git/config` credential helper from the disinto-dev-box rollout) has not been applied
 on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there
 is too much drift.
 Each existing client box needs a **one-time manual reset** to a known-good baseline
 before the versioned upgrade flow takes over. The reset is mechanical but not
 automatable — it touches host-level state that pre-dates the new flow.
 **Shape**: a documented runbook at `docs/client-box-bootstrap.md` (or similar) that
 walks operators through the one-time reset:
 1. `disinto down`
 2. `git fetch --all && git checkout <latest tag>` on the working tree
 3. Apply host-level migrations:
   - `setup_claude_config_dir true` (from `lib/claude-config.sh`, added in #641)
   - Strip embedded creds from `.git/config`'s forgejo remote and add the inline
     credential helper using the pattern from #669
   - Rotate `FORGE_PASS` and `FORGE_TOKEN` if they have leaked (separate decision)
 4. Rebuild images (`docker compose build`) or pull from registry once this sprint lands
 5. `disinto up`
 6. Verify with `disinto status` and a smoke fetch through the credential helper
 After the reset, the box is in a known-good baseline and `disinto upgrade <version>`
 takes over for all subsequent upgrades. The runbook documents this as the only manual
 operation an operator should ever have to perform on a client box.
 **Files touched**: new `docs/client-box-bootstrap.md`. Optionally a small change to
 `disinto init` to detect "this looks like a stale-state box that needs the reset
 runbook, not a fresh init" and refuse with a pointer to the runbook.
 **Dependency**: none (can be done in parallel with this sprint and the others).
 ## Updated recommendation
 The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over
 Docker Hub. Layered on top:
 - **Sequence the four follow-ups**: A (upgrade subcommand) and D (bootstrap runbook) are
  independent of this sprint's image work and can land in parallel. B (version
  unification) is small cleanup that depends on this sprint. C (migration framework) can
  wait until the first migration that actually needs it — `setup_claude_config_dir`
  doesn't, since it already lives in `disinto init`.
 - **Do not fix #665 in parallel**: as noted in "Side effects", this sprint deletes the
  cause. A `depends_on: service_healthy` workaround applied to edge in parallel would be
  wasted work.
 - **Do not file separate forge issues for the follow-ups until this sprint is broken into
  sub-issues**: keep them in this document until the architect (or the operator) is ready
  to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay
  reorderable as the sprint shape evolves.
--- a/sprints/website-observability-wire-up.md
+++ b/sprints/website-observability-wire-up.md
@ -0,0 +1,105 @@
 # Sprint: website observability wire-up
 ## Vision issues
 - #426 — Website observability — make disinto.ai an observable addressable
 ## What this enables
 After this sprint, the factory can read engagement data from disinto.ai. The planner
 will have daily evidence files in `evidence/engagement/` to answer: how many people
 visited, where they came from, which pages they viewed. Observables will exist.
 The prerequisites for two milestones unlock:
 - Adoption: "Landing page communicating value proposition" (evidence confirms it works)
 - Ship (Fold 2): "Engagement measurement baked into deploy pipelines" (verify-observable step becomes non-advisory)
 ## What exists today
 The design and most of the code are already done:
 - `site/collect-engagement.sh` — Complete. Parses Caddy's JSON access log, computes unique visitors / page views / top referrers, writes dated JSON evidence to `$OPS_REPO_ROOT/evidence/engagement/YYYY-MM-DD.json`.
 - `formulas/run-publish-site.toml` verify-observable step — Complete. Checks Caddy log activity, script presence, and evidence recency on every deploy.
 - `docs/EVIDENCE-ARCHITECTURE.md` — Documents the full pipeline: Caddy logs → collect-engagement → evidence/engagement/
 - `docs/OBSERVABLE-DEPLOY.md` — Documents the observable deploy pattern.
 - `docker/edge/Dockerfile` — Caddy edge container exists for the factory.
 What's missing is the wiring: connecting the factory to the remote Caddy host where
 disinto.ai runs.
 ## Complexity
 Files touched: 4-6 depending on fork choices
 Subsystems: vault dispatch, SSH access, log collection, ops repo evidence
 Sub-issues: 3-4
 Gluecode ratio: ~80% gluecode, ~20% greenfield (the container/formula is new)
 ## Risks
 - Production Caddy is on a separate host from the factory — all collection must go over SSH.
 - Log format mismatch: collect-engagement.sh assumes Caddy's structured JSON format. If the production Caddy uses default Combined Log Format, the script will produce empty reports silently.
 - SSH key scope: the key used for collection should be purpose-limited to avoid granting broad access.
 - Evidence commit: the container must commit evidence to the ops repo via Forgejo API (not git push over SSH) to keep the secret surface minimal.
 ## Cost — new infra to maintain
 - One vault action formula (`formulas/collect-engagement.toml` or extension of existing formula)
 - One SSH key on the Caddy host's authorized_keys
 - Daily evidence files in ops repo (evidence/engagement/*.json) — ~1KB/file
 - No new long-running services or agents
 ## Recommendation
 Worth it. The human-directed architecture (dispatchable container with SSH) is
 cleaner than running cron directly on the production host — it keeps all factory
 logic inside the factory and treats the Caddy host as a dumb data source.
 ## Design forks
 ### Q1: What does the container fetch from the Caddy host?
 *Context: `collect-engagement.sh` already parses Caddy JSON access logs into evidence JSON. The question is where that parsing happens.*
 - **(A) Fetch raw log, process locally**: Container SSHs in, copies today's access log segment (e.g. `rsync` or `scp`), then runs `collect-engagement.sh` inside the container against the local copy. The Caddy host needs zero disinto code installed.
 - **(B) Run script remotely**: Container SSHs in and executes `collect-engagement.sh` on the Caddy host. Requires the script (or a minimal version) to be deployed on the host. Output piped back.
 - **(C) Pull Caddy metrics API**: Container opens an SSH tunnel to Caddy's admin API (port 2019) and pulls request metrics directly. No log file parsing — but Caddy's metrics endpoint is less rich than full access log analysis (no referrers, no per-page breakdown).
 *Architect recommends (A): keeps the Caddy host dumb, all logic in the factory container, and `collect-engagement.sh` runs unchanged.*
 ### Q2: How is the daily collection triggered?
 *Context: Other factory agents (supervisor, planner, gardener) run on direct cron via `*-run.sh`. Vault actions go through the PR approval workflow. The collection is a recurring low-risk read-only operation.*
 - **(A) Direct cron in edge container**: Add a cron entry to the edge container entrypoint, like supervisor/planner. Simple, no vault overhead. Runs daily without approval.
 - **(B) Vault action with auto-dispatch**: Create a recurring vault action TOML. If PR #12 (blast-radius tiers) lands, low-tier actions auto-execute. If not, each run needs admin approval — too heavy for daily collection.
 - **(C) Supervisor-triggered**: Supervisor detects stale evidence (no `evidence/engagement/` file for today) and dispatches collection. Reactive rather than scheduled.
 *Architect recommends (A): this is a read-only data collection, same risk profile as supervisor health checks. Vault gating a daily log fetch adds friction without security benefit.*
 ### Q3: How is the SSH key provisioned for the collection container?
 *Context: The vault dispatcher supports `mounts: ["ssh"]` which mounts `~/.ssh` read-only into the container. The edge container already has SSH infrastructure for reverse tunnels (`disinto-tunnel` user, `autossh`).*
 - **(A) Factory operator's SSH keys** (`mounts: ["ssh"]`): Reuse the existing SSH keys on the factory host. Simple, but grants the container access to all hosts the operator can reach.
 - **(B) Dedicated purpose-limited key**: Generate a new SSH keypair, install the public key on the Caddy host with `command=` restriction (only allows `cat /var/log/caddy/access.log` or similar). Private key stored as `CADDY_SSH_KEY` in `.env.vault.enc`. Minimal blast radius.
 - **(C) Edge tunnel reverse path**: Instead of the factory SSHing *out* to Caddy, have the Caddy host push logs *in* via the existing reverse tunnel infrastructure. Inverts the connection direction but requires a log-push agent on the Caddy host.
 *Architect recommends (B): purpose-limited key with `command=` restriction on the Caddy host gives least-privilege access. The factory never gets a shell on production.*
 ## Proposed sub-issues
 ### If Q1=A, Q2=A, Q3=B (recommended path):
 1. **`collect-engagement` formula + container script**: Create `formulas/collect-engagement.toml` with steps: SSH into Caddy host using dedicated key → fetch today's access log segment → run `collect-engagement.sh` on local copy → commit evidence JSON to ops repo via Forgejo API. Add cron entry to edge container.
 2. **Format-detection guard in `collect-engagement.sh`**: Add a check at script start that verifies the input file is Caddy JSON format (not Combined Log Format). Fail loudly with actionable error if format is wrong.
 3. **`evidence/engagement/` directory + ops-setup wiring**: Ensure `lib/ops-setup.sh` creates the evidence directory. Register the engagement cron schedule in factory setup docs.
 4. **Document Caddy host SSH setup**: Rent-a-human instructions for: generate keypair, install public key with `command=` restriction on Caddy host, add private key to `.env.vault.enc`.
 ### If Q1=B (remote execution):
 Sub-issues 2-4 remain the same. Sub-issue 1 changes: container SSHs in and runs the script remotely, requiring script deployment on the Caddy host (additional manual step).
 ### If Q2=B (vault-gated):
 Sub-issue 1 changes: instead of cron, create a vault action TOML template and document the daily dispatch. Depends on PR #12 (blast-radius tiers) for auto-approval.
 ### If Q3=A (operator SSH keys):
 Sub-issue 4 is simplified (no dedicated key generation), but blast radius is wider.
 ### If Q3=C (reverse tunnel):
 Sub-issue 1 changes significantly: instead of SSH-out, configure a log-push cron on the Caddy host that sends logs through the reverse tunnel. More infrastructure on the Caddy host side.
Author	SHA1	Message	Date
architect-bot	a3653e8d31	Merge pull request 'architect: edge-subpath-chat (#623 )' (#37 ) from architect/edge-subpath-chat into main	2026-04-18 22:26:36 +00:00
architect-bot	15b37b6756	sprint: add edge-subpath-chat.md	2026-04-16 02:15:24 +00:00
disinto-admin	bb37eaf588	Merge pull request 'architect: supervisor Docker storage telemetry' (#34 ) from architect/supervisor-docker-storage into main	2026-04-15 17:39:19 +00:00
disinto-admin	590cf45139	Merge pull request 'architect: process evolution lifecycle' (#35 ) from architect/process-evolution-lifecycle into main	2026-04-15 17:39:12 +00:00
disinto-admin	8fe5da6b57	Merge pull request 'architect: supervisor project-wide oversight' (#32 ) from architect/supervisor-project-wide-oversight into main	2026-04-15 17:39:09 +00:00
disinto-admin	e3a4eb352d	Merge pull request 'architect: gatekeeper agent — external signal verification' (#31 ) from architect/gatekeeper-agent into main	2026-04-15 17:39:03 +00:00
disinto-admin	a66af9246e	Merge pull request 'architect: bug-report pipeline — inbound classification + auto-close' (#24 ) from architect/bug-report-pipeline into main	2026-04-15 17:38:55 +00:00
disinto-admin	40352a7d27	Merge pull request 'architect: agent management redesign' (#21 ) from architect/agent-management-redesign into main	2026-04-15 17:38:54 +00:00
disinto-admin	e73cb59efb	Merge pull request 'architect: example project — full lifecycle demo' (#20 ) from architect/example-project-lifecycle into main	2026-04-15 17:38:46 +00:00
disinto-admin	fffb791637	Merge pull request 'architect: vault blast-radius tiers' (#12 ) from architect/vault-blast-radius-tiers into main	2026-04-15 17:38:35 +00:00
disinto-admin	ed1e4882b2	Merge pull request 'architect: versioned agent images' (#11 ) from architect/versioned-agent-images into main	2026-04-15 17:38:33 +00:00
disinto-admin	247e03024a	Merge pull request 'architect: website observability wire-up' (#10 ) from architect/website-observability-wire-up into main	2026-04-15 17:38:11 +00:00
architect-bot	e11fd85c9a	sprint: add process-evolution-lifecycle.md	2026-04-15 10:06:02 +00:00
architect-bot	51a33bb0f1	sprint: add supervisor-project-wide-oversight.md	2026-04-15 07:25:10 +00:00
architect-bot	1715b2480b	sprint: add gatekeeper-agent.md	2026-04-15 03:02:57 +00:00
architect-bot	a3f2723626	sprint: add bug-report-pipeline.md	2026-04-12 04:09:56 +00:00
architect-bot	90789cbb5a	sprint: add agent-management-redesign.md	2026-04-12 02:02:51 +00:00
architect-bot	17f350d7a3	sprint: add example-project-lifecycle.md	2026-04-12 01:00:26 +00:00
architect-bot	75f06bd313	sprint: add design forks for website-observability-wire-up	2026-04-12 00:58:08 +00:00
disinto-admin	a5cbaae2b4	Merge pull request 'sprint(versioned-agent-images): add side-effects, follow-up sprints, updated recommendation' (#15 ) from architect/versioned-agent-images-followups into architect/versioned-agent-images Reviewed-on: #15 Reviewed-by: disinto-admin <admin@disinto.local>	2026-04-11 10:20:07 +00:00
dev-bot	3a172bcc86	sprint(versioned-agent-images): add side-effects, four follow-up sprints, updated recommendation Enriches the architect's existing sprint plan with: 1. Side effects: this sprint indirectly closes #665 (edge cold-start race) by removing the runtime clone — flagging so a parallel #665 fix isn't applied. 2. Four follow-up sprints that complete the client-box upgrade story: - A: 'disinto upgrade <version>' subcommand for atomic client-side upgrades - B: unify DISINTO_VERSION and AGENTS_IMAGE into one version concept - C: migration framework for breaking changes (per-version migration files) - D: bootstrap-from-broken-state runbook for existing drifted boxes (harb) 3. Updated recommendation that sequences the follow-ups against this sprint and notes #665 should not be fixed in parallel. The original sprint scope (4 files, ~80% gluecode, GHCR) is unchanged and remains tightly scoped. The follow-ups are deliberately kept inside this document rather than filed as separate forge issues until the sprint plan is ready to be broken into sub-issues by the architect.	2026-04-11 10:09:57 +00:00
architect-bot	174e2a63bf	sprint: add vault-blast-radius-tiers.md	2026-04-09 08:33:51 +00:00
Architect Agent	eb7b403148	sprint: add versioned-agent-images.md	2026-04-09 08:31:46 +00:00
architect-bot	326ebb867a	sprint: add website-observability-wire-up.md	2026-04-08 20:04:29 +00:00