Compare commits
5 commits
main
...
vault/fix-
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
dcc9649dbd | ||
|
|
1a39a3ed80 | ||
|
|
74dc64d134 | ||
|
|
7f9d5224ba | ||
|
|
f6702cea97 |
27 changed files with 275 additions and 892 deletions
5
RESOURCES.md
Normal file
5
RESOURCES.md
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
# RESOURCES
|
||||
|
||||
## Overview
|
||||
|
||||
<!-- Add content here -->
|
||||
0
evidence/engagement/.gitkeep
Normal file
0
evidence/engagement/.gitkeep
Normal file
0
evidence/evolution/.gitkeep
Normal file
0
evidence/evolution/.gitkeep
Normal file
0
evidence/holdout/.gitkeep
Normal file
0
evidence/holdout/.gitkeep
Normal file
0
evidence/red-team/.gitkeep
Normal file
0
evidence/red-team/.gitkeep
Normal file
0
evidence/user-test/.gitkeep
Normal file
0
evidence/user-test/.gitkeep
Normal file
0
knowledge/.gitkeep
Normal file
0
knowledge/.gitkeep
Normal file
5
portfolio.md
Normal file
5
portfolio.md
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
# Portfolio
|
||||
|
||||
## Overview
|
||||
|
||||
<!-- Add content here -->
|
||||
|
|
@ -1,5 +1,5 @@
|
|||
# Prerequisite Tree
|
||||
<!-- Last updated: 2026-04-08 -->
|
||||
<!-- Last updated: 2026-04-15 -->
|
||||
|
||||
## Objective: Foundation — Core agent loop (dev → CI → review → merge)
|
||||
- [x] dev-agent picks up backlog issues (dev/dev-agent.sh exists)
|
||||
|
|
@ -8,6 +8,9 @@
|
|||
- [x] Stale in-progress recovery (#224 — closed)
|
||||
- [x] Agent race condition fix (#160 — closed)
|
||||
- [x] Dispatcher grep Alpine fix (#150 — closed)
|
||||
- [x] Dev-poll post-crash deadlock (#749 — closed)
|
||||
- [x] Entrypoint wait deadlock (#753 — closed)
|
||||
- [x] Credential helper race on cold boot (#741 — closed)
|
||||
Status: DONE
|
||||
|
||||
## Objective: Foundation — Supervisor health monitoring
|
||||
|
|
@ -18,7 +21,7 @@ Status: DONE
|
|||
## Objective: Foundation — Planner gap analysis against vision
|
||||
- [x] Planner formula exists (run-planner.toml v4)
|
||||
- [x] planner-run.sh cron wrapper exists
|
||||
- [x] Planning runs established and maintaining prerequisite tree (run 1: 2026-04-05, run 2: 2026-04-08)
|
||||
- [x] Planning runs established (run 1: 2026-04-05, run 2: 2026-04-08, run 3: 2026-04-15)
|
||||
Status: DONE
|
||||
|
||||
## Objective: Foundation — Multi-project support
|
||||
|
|
@ -29,7 +32,7 @@ Status: DONE
|
|||
## Objective: Foundation — Knowledge graph for structural defect detection
|
||||
- [x] networkx package installed in agents container (#220 — closed)
|
||||
- [x] build-graph.py exists in lib/
|
||||
- [x] Graph report generating successfully (165 nodes, 137 edges as of 2026-04-08)
|
||||
- [x] Graph report generating successfully (217 nodes, 317 edges as of 2026-04-15)
|
||||
Status: DONE
|
||||
|
||||
## Objective: Foundation — Predictor-planner adversarial feedback loop
|
||||
|
|
@ -45,24 +48,59 @@ Status: DONE
|
|||
- [x] disinto init re-run stability (#158 — closed)
|
||||
- [x] disinto init repo creation API endpoint (#164 — closed)
|
||||
- [x] Prediction labels created during init (#225 — closed)
|
||||
- [ ] Ops repo migration for existing deployments (#425 — backlog+priority)
|
||||
Status: BLOCKED — #425 ops repo missing dirs on existing deployments
|
||||
- [x] Ops repo migration for existing deployments (#425 — closed, #688 — closed)
|
||||
- [x] Edge service restart policy (#768 — closed)
|
||||
- [ ] Ops repo branch protection blocks agent writes (#758 — blocked, bug-report) blocked-on-vault (vault/pending/disinto-ops-branch-protection.md)
|
||||
- [ ] Planner PR-based ops flow (#765 — blocked, engineering fix for #758)
|
||||
- [ ] agents-llama as first-class generator service (#769 — backlog)
|
||||
- [ ] disinto up should regenerate compose/Caddyfile from generators.sh (#770 — backlog, depends on #769)
|
||||
- [ ] Deprecate tracked docker/Caddyfile (#771 — backlog)
|
||||
- [ ] disinto down && disinto up reproducibility (#772 — blocked, depends on #769+#770+#771)
|
||||
Status: BLOCKED — #758 ops repo branch protection (human action needed); #769-#771 in backlog for bootstrap reproducibility
|
||||
|
||||
## Objective: Adoption — Built-in Forgejo + Woodpecker CI
|
||||
- [x] Docker compose with Forgejo + Woodpecker
|
||||
- [x] Woodpecker OAuth2 redirect URI fix (#172 — closed)
|
||||
- [x] WOODPECKER_HOST override fix (#178 — closed)
|
||||
- [x] CI exhaustion root cause fixed (#742 — closed)
|
||||
Status: DONE
|
||||
|
||||
## Objective: Adoption — Landing page communicating value proposition
|
||||
- [x] Website addressable exists (disinto.ai)
|
||||
- [ ] Website observability — no engagement measurement (#426 — vision)
|
||||
Status: BLOCKED — no evidence process connected to website
|
||||
- [x] Evidence/engagement directory setup (#747 — closed)
|
||||
- [x] Format-detection guard in collect-engagement.sh (#746 — closed)
|
||||
- [x] Collect-engagement formula + container script (#745 — closed, PR #761)
|
||||
- [ ] Website observability — engagement measurement wired (#426 — vision)
|
||||
Status: BLOCKED — #426 needs design decisions (vision-level), engagement collection infrastructure ready
|
||||
|
||||
## Objective: Adoption — Example project demonstrating full lifecycle
|
||||
- [ ] No example project exists
|
||||
- [ ] Requires verified bootstrap (#425)
|
||||
Status: BLOCKED — depends on bootstrap completion and ops repo migration
|
||||
- [x] Bootstrap path verified (#425, #688 — closed)
|
||||
- [ ] Example project design and implementation (#697 — vision+priority)
|
||||
Status: BLOCKED — #697 needs design (vision-level), bootstrap path verified
|
||||
|
||||
## Objective: Adoption — Subpath routing + Forgejo-OAuth-gated Claude chat (#623)
|
||||
- [x] Caddy subpath routing skeleton (#704 — closed)
|
||||
- [x] Chat container scaffold (#705 — closed)
|
||||
- [x] Chat sandbox hardening (#706 — closed)
|
||||
- [x] Claude identity isolation (#707 — closed)
|
||||
- [x] Forgejo OAuth gate (#708 — closed)
|
||||
- [x] Caddy Remote-User forwarding (#709 — closed)
|
||||
- [x] Conversation history persistence (#710 — closed)
|
||||
- [x] Cost caps + rate limiting (#711 — closed)
|
||||
- [x] Escalation tools (#712 — closed)
|
||||
- [x] Per-project subdomain fallback (#713 — closed)
|
||||
Status: DONE — all 10 sub-issues closed, parent #623 awaiting architect close
|
||||
|
||||
## Objective: Adoption — Architect agent reliability
|
||||
- [x] Architect FORGE_TOKEN override bug (#762 — closed 2026-04-15)
|
||||
- [x] Architect pitch prompt guardrail bypass (#764 — closed 2026-04-15)
|
||||
Status: DONE
|
||||
|
||||
## Objective: Adoption — Versioned agent images (#429)
|
||||
- [ ] Publish versioned agent images — compose should use image: not build: (#429 — in-progress, vision)
|
||||
Status: IN PROGRESS — #429 being worked on
|
||||
|
||||
## --- ADOPTION MILESTONE: IN PROGRESS ---
|
||||
|
||||
## Objective: Ship (Fold 2) — Deploy profiles per artifact type
|
||||
- [ ] No deploy profiles defined
|
||||
|
|
@ -72,8 +110,10 @@ Status: BLOCKED — not started, needs design (vision-level)
|
|||
## Objective: Ship (Fold 2) — Vault-gated fold transitions
|
||||
- [x] Vault redesign complete (#73-#77 — all closed)
|
||||
- [x] Vault PR workflow documented (docs/VAULT.md)
|
||||
- [ ] Vault directories complete in ops repo (#425 — approved/fired/rejected missing)
|
||||
Status: BLOCKED — #425 ops repo dirs needed for vault workflow
|
||||
- [x] Vault directories seeded in ops repo (#425, #688 — closed)
|
||||
- [ ] Ops repo branch protection blocks vault item visibility (#758) blocked-on-vault (vault/pending/disinto-ops-branch-protection.md)
|
||||
- [ ] vault_request RETURN trap fires prematurely (#773 — backlog, bug-report)
|
||||
Status: BLOCKED — #758 prevents vault items from reaching remote; #773 vault bug in backlog
|
||||
|
||||
## Objective: Ship (Fold 2) — Engagement measurement baked into deploy pipelines
|
||||
- [ ] No engagement measurement exists
|
||||
|
|
@ -82,6 +122,7 @@ Status: BLOCKED — depends on deploy profiles + website observability (#426)
|
|||
|
||||
## Objective: Ship (Fold 2) — Rent-a-human for gated channels
|
||||
- [x] run-rent-a-human formula exists
|
||||
- [x] Caddy SSH key setup documented (#748 — closed)
|
||||
- [ ] Not yet exercised in production
|
||||
Status: READY
|
||||
|
||||
|
|
|
|||
0
sprints/.gitkeep
Normal file
0
sprints/.gitkeep
Normal file
|
|
@ -1,52 +0,0 @@
|
|||
# Sprint: agent management redesign
|
||||
|
||||
## Vision issues
|
||||
- #557 — redesign agent management — hire by inference backend, list by capability
|
||||
|
||||
## What this enables
|
||||
|
||||
After this sprint, operators can:
|
||||
1. Hire agents by backend (disinto hire anthropic, disinto hire llama --url ...) instead of inventing names and roles
|
||||
2. List all agents (disinto agents list) with backend, model, roles, and status in one table
|
||||
3. Discover what is running without grepping compose files, TOML configs, and state directories
|
||||
|
||||
The factory becomes self-describing: an operator who inherits a running instance can immediately see what agents exist, what backends they use, and what roles they fill.
|
||||
|
||||
## What exists today
|
||||
|
||||
The agent management system is functional but fragmented:
|
||||
|
||||
- disinto hire-an-agent name role (lib/hire-agent.sh): Creates Forgejo user, .profile repo, API token, state file, and optionally writes agents TOML section plus regenerates compose. Works, but the mental model is backwards — operator must invent a name and pick a role before specifying the backend.
|
||||
- disinto agent enable/disable/status (bin/disinto): Manages state files for 6 hardcoded core agents (dev, reviewer, gardener, architect, planner, predictor). Local-model agents are invisible to this command.
|
||||
- agents TOML sections (projects/*.toml): Store local-model agent config (base_url, model, roles, forge_user). Read by lib/generators.sh to generate per-agent docker-compose services.
|
||||
- AGENT_ROLES env var: Runtime gate in entrypoint.sh — comma-separated list of roles the container runs.
|
||||
- Compose profiles: Local-model agents gated by profiles, requiring explicit --profile to start.
|
||||
|
||||
State lives in three disconnected places: state files (CLI), env vars (runtime), compose services (docker). No single command unifies them.
|
||||
|
||||
## Complexity
|
||||
|
||||
- Files touched: ~4 (bin/disinto, lib/hire-agent.sh, lib/generators.sh, docker/agents/entrypoint.sh)
|
||||
- Subsystems: CLI, compose generator, container entrypoint, project TOML schema
|
||||
- Estimated sub-issues: 4-5
|
||||
- Gluecode vs greenfield: ~80% gluecode (refactoring existing hire-agent.sh and CLI), ~20% greenfield (new agents list output, backend-first hire UX)
|
||||
|
||||
## Risks
|
||||
|
||||
- Breaking existing hire-an-agent: The old command must keep working during transition. Operators may have scripts that call it. Deprecation path needed.
|
||||
- State migration: Existing local-model agents configured via agents TOML need to work unchanged. The new system reads the same TOML — no migration required if we keep the schema.
|
||||
- Entrypoint.sh hardcoded list: The 6 core agents are hardcoded in multiple places (entrypoint.sh, bin/disinto). Making this dynamic requires careful testing to avoid breaking the polling loop.
|
||||
- TOML parsing fragility: The hire-agent.sh TOML writer uses a Python inline script. Changes to the TOML schema could break parsing if not tested.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- No new services, cron jobs, or formulas. This is a refactor of existing CLI and configuration paths.
|
||||
- New code: disinto hire subcommand (~100 lines), disinto agents list subcommand (~80 lines), agent registry logic that unifies the three state sources (~50 lines).
|
||||
- Removed code: Portions of the current hire-an-agent that duplicate backend detection logic.
|
||||
- Ongoing: The hardcoded agent list in bin/disinto and entrypoint.sh becomes a derived list (from state files + TOML + compose). Slightly more complex discovery logic, but eliminates the need to update hardcoded lists when new agent types are added.
|
||||
|
||||
## Recommendation
|
||||
|
||||
Worth it. This is a high-value, low-risk refactor that directly improves the adoption story. The current UX is the number one friction point for new operators — hire-an-agent requires knowing three things (name, role, backend) in the wrong order. The redesign makes the common case (disinto hire anthropic) a one-liner and gives operators visibility into what is running. No new infrastructure, no new dependencies, mostly gluecode over existing interfaces.
|
||||
|
||||
Defer only if the team wants to stabilize the current agent set first (all 4 open architect sprints are pending human review). Otherwise, this is independent work that does not conflict with any in-flight sprint.
|
||||
|
|
@ -1,54 +0,0 @@
|
|||
# Sprint pitch: bug-report pipeline — inbound classification + auto-close
|
||||
|
||||
## Vision issues
|
||||
- #388 — end-to-end bug-report management — inbound classification, reproduction routing, and auto-close loop
|
||||
|
||||
## What this enables
|
||||
|
||||
After this sprint, bug-reports flow through a **cheap classification gate** before reaching the expensive reproduce-agent. Inspection-class bugs (stack trace cited, cause obvious from code) go straight to dev-agent — saving the full Playwright/MCP environment spin-up. The auto-close loop fires reliably, and upstream Codeberg reporters get notified when their bug is fixed.
|
||||
|
||||
Today: every bug-report → reproduce-agent (expensive). After: only ambiguous bugs → reproduce-agent; obvious bugs → dev-agent directly.
|
||||
|
||||
## What exists today
|
||||
|
||||
The pipeline is 80% built:
|
||||
|
||||
| Component | Status | Location |
|
||||
|-----------|--------|----------|
|
||||
| Gardener bug-report detection + enrichment | Complete | `formulas/run-gardener.toml:79-134` |
|
||||
| Reproduce-agent (Playwright MCP, exit gates) | Complete | `formulas/reproduce.toml`, `docker/reproduce/` |
|
||||
| Triage-agent (6-step root cause) | Complete | `formulas/triage.toml` |
|
||||
| Dev-poll label gating (skips `bug-report`) | Complete | `dev/dev-poll.sh` |
|
||||
| Auto-close decomposed parents | Complete (not firing) | `formulas/run-gardener.toml:224-269` |
|
||||
| Issue templates (bug.yaml, feature.yaml) | Complete | `.forgejo/ISSUE_TEMPLATE/` |
|
||||
| Manifest action system | Complete | `gardener/pending-actions.json` |
|
||||
|
||||
Reusable infrastructure: formula-session.sh, agent-sdk.sh, issue-lifecycle.sh label helpers, parse-deps.sh dependency extraction, manifest-driven mutation pattern.
|
||||
|
||||
## Complexity
|
||||
|
||||
- **5-6 sub-issues** estimated
|
||||
- **~8 files touched** across formulas, lib, and gardener
|
||||
- **Mostly gluecode** — extending existing gardener formula, adding a classification step, wiring auto-close reliability, adding upstream notification
|
||||
- **One new formula step** (inbound classifier in run-gardener.toml or a dedicated pre-check)
|
||||
- **No new containers or services** — classification runs inside existing gardener session
|
||||
|
||||
## Risks
|
||||
|
||||
- **Classification accuracy** — the cheap pre-check might route ambiguous bugs to dev-agent, wasting dev cycles on bugs it can't fix without reproduction. Mitigation: conservative skip-reproduction criteria (all four pre-check questions must be clean).
|
||||
- **Gardener formula complexity** — run-gardener.toml is already the most complex formula. Adding classification logic increases cognitive load. Mitigation: classification could be a separate formula step with clear entry/exit gates.
|
||||
- **Upstream Codeberg notification** — requires Codeberg API token in `.env.vault.enc`. Currently in `.netrc` on host but not in containers. Needs vault action for the actual notification (AD-006 compliance).
|
||||
- **Auto-close timing** — if gardener runs are infrequent (every 6h), auto-close feedback loop is slow. Not a sprint problem per se, but worth noting.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- **One new gardener formula step** (inbound classification) — maintained alongside existing grooming step
|
||||
- **Bug taxonomy labels** (bohrbug, heisenbug, mandelbug, schrodinbug or simplified equivalents) — 2-4 new labels
|
||||
- **No new services, cron jobs, or agent roles** — everything runs within existing gardener cycle
|
||||
- **Codeberg notification vault action template** — one new TOML in `vault/examples/`
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Worth it.** The infrastructure is 80% built. This sprint fills the two concrete gaps (classification gate + auto-close reliability) with minimal new maintenance burden. The biggest value is avoiding unnecessary reproduce-agent runs — each one costs a full Claude session with Playwright MCP for bugs that could be triaged by reading code. The auto-close fix is nearly free (the logic exists, just needs the gardener to run reliably). Upstream notification is a small vault action addition.
|
||||
|
||||
Defer the statistical reproduction mode (Heisenbug handling) and bulk deduplication to a follow-up sprint — they add complexity without proportional value at current bug volume.
|
||||
|
|
@ -1,106 +0,0 @@
|
|||
# Sprint: edge-subpath-chat
|
||||
|
||||
## Vision issues
|
||||
- #623 — vision: subpath routing + Forgejo-OAuth-gated Claude chat inside the edge container
|
||||
|
||||
## What this enables
|
||||
After this sprint, an operator running `disinto edge register` gets a single URL — `<project>.disinto.ai` — with Forgejo at `/forge/`, Woodpecker CI at `/ci/`, a staging preview at `/staging/`, and an OAuth-gated Claude Code chat at `/chat/`, all under one wildcard cert and one bootstrap password. The factory talks back to its operator through a chat window that sits next to the forge, CI, and live preview it is driving.
|
||||
|
||||
## What exists today
|
||||
The majority of this vision is already implemented across issues #704–#711:
|
||||
|
||||
- **Subpath routing**: Caddyfile generator produces `/forge/*`, `/ci/*`, `/staging/*`, `/chat/*` handlers (`lib/generators.sh:780–822`). Forgejo `ROOT_URL` and Woodpecker `WOODPECKER_HOST` are set to subpath values when `EDGE_TUNNEL_FQDN` is present (`bin/disinto:842–847`).
|
||||
- **Chat container**: Full OAuth flow via Forgejo, HttpOnly session cookies, forward_auth defense-in-depth with `FORWARD_AUTH_SECRET`, per-user rate limiting (hourly/daily/token caps), conversation history in NDJSON (`docker/chat/server.py`).
|
||||
- **Sandbox hardening**: Read-only rootfs, `cap_drop: ALL`, `no-new-privileges`, `pids_limit: 128`, `mem_limit: 512m`, no Docker socket. Verification script at `tools/edge-control/verify-chat-sandbox.sh`.
|
||||
- **Edge control plane**: Tunnel registration, port allocation, Caddy admin API routing, wildcard `*.disinto.ai` cert via DNS-01 (`tools/edge-control/`).
|
||||
- **Dependencies #620/#621/#622**: Admin password prompt, edge control plane, and reverse tunnel — all implemented and merged.
|
||||
- **Subdomain fallback plan**: Fully documented at `docs/edge-routing-fallback.md` with pivot criteria.
|
||||
|
||||
## Complexity
|
||||
- ~6 files touched across 3 subsystems (Caddy routing, chat backend, compose generation)
|
||||
- Estimated 4 sub-issues
|
||||
- ~90% gluecode (wiring existing pieces), ~10% greenfield (WebSocket streaming, end-to-end smoke test)
|
||||
|
||||
## Risks
|
||||
- **Forgejo/Woodpecker subpath breakage**: Neither service is battle-tested under subpaths in this stack. Redirect loops, OAuth callback mismatches, or asset 404s are plausible. Mitigation: the fallback plan (`docs/edge-routing-fallback.md`) is already documented and estimated at under one day to pivot.
|
||||
- **Cookie/CSRF collision**: Forgejo and chat share the same origin — cookie names or CSRF tokens could collide. Mitigation: chat uses a namespaced cookie (`disinto_chat_session`) and a separate OAuth app.
|
||||
- **Streaming latency**: One-shot `claude --print` blocks until completion. Long responses leave the operator staring at a spinner. Not a correctness risk, but a UX risk that WebSocket streaming would fix.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
- **No new services** — all containers already exist in the compose stack
|
||||
- **No new scheduled tasks or formulas** — chat is a passive request handler
|
||||
- **One new smoke test** (CI) — end-to-end subpath routing verification
|
||||
- **Ongoing**: monitoring Forgejo/Woodpecker upstream for subpath regressions on upgrades
|
||||
|
||||
## Recommendation
|
||||
Worth it. The vision is ~80% implemented. The remaining work is integration hardening (confirming subpath routing works end-to-end with real Forgejo/Woodpecker) and one UX improvement (WebSocket streaming). The risk is low because a documented fallback to per-service subdomains exists. Ship this sprint to close the loop on the edge experience.
|
||||
|
||||
## Sub-issues
|
||||
|
||||
<!-- filer:begin -->
|
||||
- id: subpath-routing-smoke-test
|
||||
title: "vision(#623): end-to-end subpath routing smoke test for Forgejo + Woodpecker + chat"
|
||||
labels: [backlog]
|
||||
depends_on: []
|
||||
body: |
|
||||
## Goal
|
||||
Verify that Forgejo, Woodpecker, and chat all function correctly when served
|
||||
under /forge/, /ci/, and /chat/ subpaths on a single domain. Catch redirect
|
||||
loops, OAuth callback failures, and asset 404s before they hit production.
|
||||
## Acceptance criteria
|
||||
- [ ] Forgejo login at /forge/ completes without redirect loops
|
||||
- [ ] Forgejo OAuth callback for Woodpecker succeeds under subpath
|
||||
- [ ] Woodpecker dashboard loads all assets at /ci/ (no 404s on JS/CSS)
|
||||
- [ ] Chat OAuth login flow works at /chat/login
|
||||
- [ ] Forward_auth on /chat/* rejects unauthenticated requests with 401
|
||||
- [ ] Staging content loads at /staging/
|
||||
- [ ] Root / redirects to /forge/
|
||||
- [ ] CI pipeline added to .woodpecker/ to run this test on edge-related changes
|
||||
|
||||
- id: websocket-streaming-chat
|
||||
title: "vision(#623): WebSocket streaming for chat UI to replace one-shot claude --print"
|
||||
labels: [backlog]
|
||||
depends_on: [subpath-routing-smoke-test]
|
||||
body: |
|
||||
## Goal
|
||||
Replace the blocking one-shot claude --print invocation in the chat backend with
|
||||
a WebSocket connection that streams tokens to the UI as they arrive.
|
||||
## Acceptance criteria
|
||||
- [ ] /chat/ws endpoint accepts WebSocket upgrade with valid session cookie
|
||||
- [ ] /chat/ws rejects upgrade if session cookie is missing or expired
|
||||
- [ ] Chat backend streams claude output over WebSocket as text frames
|
||||
- [ ] UI renders tokens incrementally as they arrive
|
||||
- [ ] Rate limiting still enforced on WebSocket messages
|
||||
- [ ] Caddy proxies WebSocket upgrade correctly through /chat/ws with forward_auth
|
||||
|
||||
- id: chat-working-dir-scoping
|
||||
title: "vision(#623): scope Claude chat working directory to project staging checkout"
|
||||
labels: [backlog]
|
||||
depends_on: [subpath-routing-smoke-test]
|
||||
body: |
|
||||
## Goal
|
||||
Give the chat container Claude session read-write access to the project working
|
||||
tree so the operator can inspect, explain, or modify code — scoped to that tree
|
||||
only, with no access to factory internals, secrets, or Docker socket.
|
||||
## Acceptance criteria
|
||||
- [ ] Chat container bind-mounts the project working tree as a named volume
|
||||
- [ ] Claude invocation in server.py sets cwd to the workspace directory
|
||||
- [ ] Claude permission mode is acceptEdits (not bypassPermissions)
|
||||
- [ ] verify-chat-sandbox.sh updated to assert workspace mount exists
|
||||
- [ ] Compose generator adds the workspace volume conditionally
|
||||
|
||||
- id: subpath-fallback-automation
|
||||
title: "vision(#623): automate subdomain fallback pivot if subpath routing fails"
|
||||
labels: [backlog]
|
||||
depends_on: [subpath-routing-smoke-test]
|
||||
body: |
|
||||
## Goal
|
||||
If the smoke test reveals unfixable subpath issues, automate the pivot to
|
||||
per-service subdomains so the switch is a single config change.
|
||||
## Acceptance criteria
|
||||
- [ ] generators.sh _generate_caddyfile_impl accepts EDGE_ROUTING_MODE env var
|
||||
- [ ] In subdomain mode, Caddyfile emits four host blocks per edge-routing-fallback.md
|
||||
- [ ] register.sh registers additional subdomain routes when EDGE_ROUTING_MODE=subdomain
|
||||
- [ ] OAuth redirect URIs in ci-setup.sh respect routing mode
|
||||
- [ ] .env template documents EDGE_ROUTING_MODE with a comment referencing the fallback doc
|
||||
<!-- filer:end -->
|
||||
|
|
@ -1,62 +0,0 @@
|
|||
# Sprint: example project — full lifecycle demo
|
||||
|
||||
## Vision issues
|
||||
- #697 — vision: example project demonstrating the full disinto lifecycle
|
||||
|
||||
## What this enables
|
||||
|
||||
After this sprint, a new user can see disinto working end-to-end on a real project:
|
||||
`disinto init` → seed issues appear → dev-agent picks one up → PR opens → CI runs →
|
||||
review-agent approves → merge → repeat. The example repo serves as both proof-of-concept
|
||||
and onboarding reference.
|
||||
|
||||
This unblocks:
|
||||
- **Adoption — Example project demonstrating full lifecycle** (directly)
|
||||
- **Adoption — Landing page** (indirectly — the example is the showcase artifact)
|
||||
- **Contributors** (lower barrier — people can see how disinto works before trying it)
|
||||
|
||||
## What exists today
|
||||
|
||||
- `disinto init <url>` fully bootstraps a project: creates repos, ops repo, branch protection,
|
||||
issue templates, VISION.md template, docker-compose stack, cron scheduling
|
||||
- Dev-agent pipeline is proven: issue → branch → implement → PR → CI → review → merge
|
||||
- Review-agent, gardener, supervisor all operational
|
||||
- Project TOML templates exist (`projects/*.toml.example`)
|
||||
- Issue template for bug reports exists; `disinto init` copies it to target repos
|
||||
|
||||
What's missing: an actual example project repo with seed content and seed issues that
|
||||
demonstrate the loop.
|
||||
|
||||
## Complexity
|
||||
|
||||
Files touched: 3-5 in the disinto repo (documentation, possibly `disinto init` tweaks)
|
||||
New artifacts: 1 example project repo with seed files, 3-5 seed issues
|
||||
Subsystems: bootstrap, dev-agent, CI, review
|
||||
Sub-issues: 3-4
|
||||
Gluecode ratio: ~70% content/documentation, ~30% scripting
|
||||
|
||||
## Risks
|
||||
|
||||
- **Maintenance burden**: The example project must stay working as disinto evolves.
|
||||
If `disinto init` changes, the example may break. Mitigation: keep the example
|
||||
minimal so there's less surface to break.
|
||||
- **CI environment**: The example needs a working Woodpecker pipeline. If the
|
||||
example uses a language that needs a specific Docker image in CI, that's a dependency.
|
||||
Mitigation: choose a language/stack with zero build dependencies.
|
||||
- **Seed issue quality**: If seed issues are too vague, the dev-agent will refuse them
|
||||
(`underspecified`). If too trivial, the demo doesn't impress. Need a sweet spot.
|
||||
- **Scope creep**: "Full lifecycle" could mean bootstrap-to-deploy. For this sprint,
|
||||
scope to bootstrap-to-merge. Deploy profiles are a separate milestone.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- One example project repo (hosted on Forgejo, mirrored to Codeberg/GitHub)
|
||||
- Seed issues re-created on each fresh `disinto init` run (or documented as manual step)
|
||||
- No new services, agents, or cron jobs
|
||||
|
||||
## Recommendation
|
||||
|
||||
Worth it. This is the single most impactful adoption artifact. A working example
|
||||
answers "does this actually work?" in a way that documentation cannot. The example
|
||||
should be dead simple — a static site or a shell-only project — so it works on any
|
||||
machine without language-specific dependencies.
|
||||
|
|
@ -1,58 +0,0 @@
|
|||
# Sprint: gatekeeper agent
|
||||
|
||||
## Vision issues
|
||||
- #485 — gatekeeper agent: verify external signals before they enter the factory
|
||||
|
||||
## What this enables
|
||||
|
||||
The factory gains a trust boundary between external mirrors (Codeberg, GitHub) and the internal Forgejo issue tracker. Today, external bug reports and feature requests must be manually copied into internal Forgejo — there is no automated inbound path from mirrors. The gatekeeper closes this gap by:
|
||||
|
||||
1. Polling external mirrors for new issues
|
||||
2. Verifying claims against internal ground truth (git history, agent logs, system state)
|
||||
3. Creating sanitized internal issues only when claims are confirmed
|
||||
4. Defending against prompt injection via issue body rewriting (evidence-based, never verbatim copy)
|
||||
|
||||
This directly supports the Growth goals: "attract developers", "contributors", "lower the barrier to entry" — external contributors can file issues on public mirrors and the factory processes them safely.
|
||||
|
||||
## What exists today
|
||||
|
||||
- **Mirror push** (lib/mirrors.sh): outbound only — pushes primary branch + tags to configured mirrors after each merge. No inbound path.
|
||||
- **Bug-report pipeline** (docker/edge/dispatcher.sh): reproduce -> triage -> verify agents handle internal bug reports. The gatekeeper slots upstream of this pipeline.
|
||||
- **Secret scanning** (lib/secret-scan.sh): detects and redacts secrets in issue bodies before posting. Reusable for gatekeeper sanitization.
|
||||
- **Issue lifecycle** (lib/issue-lifecycle.sh): claim model, dependency checking, label filtering. Gatekeeper follows the same patterns.
|
||||
- **Per-agent identities** (lib/hire-agent.sh): Forgejo bot accounts with dedicated tokens. Gatekeeper gets its own identity.
|
||||
- **Agent run pattern**: all slow agents follow the same *-run.sh + formula pattern (gardener-run.sh, planner-run.sh, etc.). Gatekeeper follows this exactly.
|
||||
- **Vault dispatch** for external actions: PR-based approval workflow. External API tokens (GITHUB_TOKEN, CODEBERG_TOKEN) are vault-only per AD-006.
|
||||
|
||||
## Complexity
|
||||
|
||||
- **New files (~5):** gatekeeper/gatekeeper-run.sh, gatekeeper/AGENTS.md, formulas/run-gatekeeper.toml, state marker
|
||||
- **Modified files (~3):** docker/agents/entrypoint.sh (polling registration), AGENTS.md (agent table), docker-compose env vars
|
||||
- **Gluecode vs greenfield:** ~70% gluecode (agent scaffolding, polling, issue creation reuse existing patterns), ~30% greenfield (external API polling, claim verification logic, prompt injection defense)
|
||||
- **Estimated sub-issues:** 5-7
|
||||
|
||||
## Risks
|
||||
|
||||
1. **Prompt injection via crafted issue bodies** — primary risk. External issue bodies are untrusted input that could manipulate agents if copied verbatim. Mitigation: gatekeeper rewrites issues based on verified evidence, never copies external content. The internal issue body is authored by the gatekeeper, not the reporter.
|
||||
|
||||
2. **AD-006 tension** — the gatekeeper needs READ access to external APIs (GitHub Issues API, Codeberg Issues API), which requires GITHUB_TOKEN and CODEBERG_TOKEN. These are vault-only tokens per AD-006. The token access pattern is a design fork (see below).
|
||||
|
||||
3. **False positives/negatives** — gatekeeper may reject legitimate reports (false positive) or admit crafted ones (false negative). Mitigated by the existing reproduce/triage pipeline downstream — the gatekeeper is a first filter, not the only filter.
|
||||
|
||||
4. **Rate limiting** — GitHub API has 5000 req/hour (authenticated). Codeberg (Gitea) is typically more permissive. Hourly polling with pagination is well within limits.
|
||||
|
||||
5. **Scope creep** — the gatekeeper could expand into a full external community management tool. Sprint should be scoped to: poll -> verify -> create internal issue. No auto-responses to external reporters in v1.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- **1 new agent** in the polling loop (gatekeeper-run.sh, ~6h interval like gardener/architect)
|
||||
- **1 new Forgejo bot account** (gatekeeper-bot)
|
||||
- **1 new formula** (run-gatekeeper.toml)
|
||||
- **External API dependency** — relies on GitHub/Codeberg APIs being available. Failure is graceful (skip poll, retry next interval).
|
||||
- **No new services, no new containers** — runs in the existing agents container.
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Worth it.** This is the missing inbound path for the mirror infrastructure that already exists. The outbound side (mirror push) has been working since early in the project. The gatekeeper completes the loop: code goes out via mirrors, feedback comes back via gatekeeper. The implementation is mostly gluecode following established patterns, with the greenfield work concentrated in the verification logic and prompt injection defense — both well-scoped problems.
|
||||
|
||||
The main decision point is how to handle AD-006 (external token access). This is a design fork that needs human input before sub-issues can be filed.
|
||||
|
|
@ -1,62 +0,0 @@
|
|||
# Sprint: process evolution lifecycle
|
||||
|
||||
## Vision issues
|
||||
- #418 — Process evolution: observe-propose-shadow-promote lifecycle
|
||||
|
||||
## What this enables
|
||||
|
||||
After this sprint, the factory can **safely mutate its own processes**. Today, agents observe project state (stuck issues, stale evidence) but not process state (how long do reviews take? how often do dev sessions fail on the same pattern? which formula steps are bottlenecks?). Process changes are manual edits to formulas with no testing path.
|
||||
|
||||
After this sprint:
|
||||
- Agents collect structured process metrics (review latency, failure rates, escalation frequency)
|
||||
- The predictor can propose process changes as structured RFCs with evidence
|
||||
- Process changes can shadow-run alongside the current process before promotion
|
||||
- Humans gate the shadow-to-promote transition via the existing vault/PR approval pattern
|
||||
|
||||
The factory becomes self-improving in a controlled, reversible way.
|
||||
|
||||
## What exists today
|
||||
|
||||
Strong foundations — most of the lifecycle has analogues already:
|
||||
|
||||
| Stage | Existing analogue | Gap |
|
||||
|-------|------------------|-----|
|
||||
| **Observe** | Predictor scans project state; knowledge graph does structural analysis | No process-state metrics (latency, failure rates, throughput) |
|
||||
| **Propose** | Prediction-triage workflow (predictor files, planner triages) | Predictions are about project state, not process changes; no RFC format |
|
||||
| **Shadow** | Nothing | No infrastructure to run two processes in parallel and compare |
|
||||
| **Promote** | Vault PR approval; sprint ACCEPT/REJECT | Not wired to process lifecycle |
|
||||
|
||||
Additional existing infrastructure:
|
||||
- `.profile/lessons-learned.md` captures per-agent learning (abstract patterns)
|
||||
- `ops/knowledge/planner-memory.md` persists planner observations across runs
|
||||
- `docs/EVIDENCE-ARCHITECTURE.md` defines sense vs mutation processes
|
||||
- Formulas (`formulas/*.toml`) define processes but have no versioning
|
||||
|
||||
## Complexity
|
||||
|
||||
- **Files touched**: ~6-8 (knowledge graph, predictor formula, planner formula, new process-metrics formula, evidence architecture docs, ops repo RFC directory)
|
||||
- **Subsystems**: predictor, planner, knowledge graph, formula-session, evidence pipeline
|
||||
- **Estimated sub-issues**: 6-8
|
||||
- **Gluecode vs greenfield**: ~60% gluecode (extending prediction-triage, adding graph nodes, wiring evidence collection) / ~40% greenfield (process metrics collector, RFC format, shadow-run comparator)
|
||||
|
||||
## Risks
|
||||
|
||||
1. **Compute cost**: Shadow-running doubles resource usage during shadow periods. Needs a time-bound or cycle-bound cap.
|
||||
2. **Wrong metrics**: Process metrics must be carefully chosen — optimizing for speed could sacrifice quality. The predictor's existing "evidence strength" checks provide a model.
|
||||
3. **Scope creep**: "Process evolution" could expand endlessly. This sprint should deliver the pipeline (observe, propose, shadow, promote) for ONE process as proof-of-concept, not all processes at once.
|
||||
4. **Over-engineering risk**: The factory has ~10 agents, not 1000 microservices. The mechanism should be proportional to the system's complexity. A lightweight RFC-in-ops-repo approach is better than a framework.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- **Process metrics formula** (`formulas/collect-process-metrics.toml`): new formula, runs on predictor/planner schedule. Collects from git log, CI API, and issue timeline.
|
||||
- **RFC directory** (`ops/process-rfcs/`): new directory in ops repo. Low maintenance — just markdown files.
|
||||
- **Shadow-run comparator**: new step in formula-session.sh that can fork a formula step between current and candidate implementations. Needs cleanup logic for shadow artifacts.
|
||||
- **No new services or containers** — this extends existing agent capabilities, doesn't add new ones.
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Worth it — but scope tightly to one proof-of-concept process.**
|
||||
|
||||
The prediction-triage workflow already implements observe-to-propose. Extending it to include shadow-to-promote is a natural evolution, not a leap. The key risk is scope creep — this sprint should deliver the pipeline for ONE process mutation (e.g., "skip review for docs-only PRs" or "auto-close stale predictions after 7 days") and prove the lifecycle works end-to-end.
|
||||
|
||||
Defer: building a generic process evolution framework. The first sprint proves the pattern; generalization comes later if the pattern holds.
|
||||
|
|
@ -1,48 +0,0 @@
|
|||
# Sprint: supervisor Docker storage telemetry
|
||||
|
||||
## Vision issues
|
||||
- #545 — supervisor should detect Docker btrfs subvolume usage explicitly, not rely solely on `df -h /`
|
||||
|
||||
## What this enables
|
||||
|
||||
After this sprint, the supervisor knows *where* disk pressure comes from — not just that it exists. When disk hits 80%, the supervisor can distinguish between "Docker images are bloated" vs "build cache is huge" vs "volumes are growing" vs "btrfs metadata overhead" and take the right remediation action. Trend-aware journaling lets the supervisor detect patterns across runs ("this image rebuilt 12 times today, 8 GB dangling layers") and escalate proactively before P1 thresholds are crossed.
|
||||
|
||||
## What exists today
|
||||
|
||||
- **Supervisor runs every 20 min** in the edge container via `supervisor-run.sh` -> `preflight.sh` -> Claude formula
|
||||
- **Edge container has Docker socket** (`/var/run/docker.sock` mount) and root access
|
||||
- **docker-cli installed** (Alpine `apk add docker-cli`)
|
||||
- **Disk check**: `df -h / | awk 'NR==2{print $5}'` — filesystem-level only, no Docker breakdown
|
||||
- **P1 remediation**: `docker system prune -f`, then `-a -f` if still >80%
|
||||
- **Preflight framework** is structured text with `## Section` headers — easy to extend
|
||||
- **Journal writing** already appends per-run findings to daily markdown files
|
||||
- **Vault filing** for unresolved issues already in place
|
||||
|
||||
All infrastructure needed for this sprint is already built. This is pure extension of existing capabilities.
|
||||
|
||||
## Complexity
|
||||
|
||||
- **3 files touched**: `supervisor/preflight.sh` (new section), `formulas/run-supervisor.toml` (enhanced P1 logic), possibly a small helper
|
||||
- **3-4 sub-issues** estimated
|
||||
- **Gluecode ratio: ~90%** — calling `docker system df`, parsing JSON output, adding conditional branches for storage driver
|
||||
- **Greenfield: ~10%** — btrfs-specific detection logic (if btrfs driver detected)
|
||||
|
||||
## Risks
|
||||
|
||||
- **btrfs tools not in container image**: `btrfs filesystem df`, `compsize` etc. require `btrfs-progs` package. May need an `apk add` in the Dockerfile or graceful degradation when tools are absent.
|
||||
- **`docker system df -v` can be slow**: On systems with hundreds of images, verbose output takes seconds. Preflight already runs in a time-bounded context, but worth watching.
|
||||
- **btrfs CoW sharing makes size reporting unreliable**: Apparent size != exclusive size. The sprint should report both where available, but document the caveat clearly.
|
||||
- **Not all deployments use btrfs**: The solution must work with overlay2 (the common case) and degrade gracefully — btrfs-specific telemetry is additive, not required.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- **No new services, cron jobs, or containers** — extends existing supervisor preflight
|
||||
- **No new formulas** — extends existing `run-supervisor.toml`
|
||||
- **Minimal ongoing cost**: if Docker changes `docker system df` output format (unlikely), the parser needs updating. Otherwise maintenance-free.
|
||||
- **One conditional branch** in remediation logic (storage-driver-aware cleanup) — adds ~20 lines to the formula
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Worth it.** This sprint addresses a real production incident (harb-dev-box 98% disk with no visible culprit), the fix is low-risk gluecode extending well-tested infrastructure, and it adds zero ongoing maintenance burden. The information gap is the root cause of blind remediation — the supervisor currently prunes and hopes. With Docker storage telemetry, it can make informed decisions and escalate intelligently.
|
||||
|
||||
The btrfs-specific parts should degrade gracefully when tools are absent, making this useful on all storage drivers while providing extra resolution on btrfs deployments.
|
||||
|
|
@ -1,42 +0,0 @@
|
|||
# Sprint: supervisor project-wide oversight
|
||||
|
||||
## Vision issues
|
||||
- #540 — supervisor should have project-wide oversight, not just self-monitoring
|
||||
|
||||
## What this enables
|
||||
After this sprint, the supervisor can:
|
||||
1. Discover all Docker Compose stacks on the deployment box — not just the disinto factory
|
||||
2. Attribute resource pressure to specific stacks — "harb-anvil-1 grew 12 GB" instead of "disk at 98%"
|
||||
3. Surface cross-stack symptoms (restarting containers, unhealthy services, volume bloat) without per-project knowledge
|
||||
4. Coordinate remediation through vault items naming the stack owner, rather than blindly pruning
|
||||
|
||||
This turns the supervisor from a single-project health monitor into a deployment-box health monitor — critical because factory deployments coexist with the projects they supervise.
|
||||
|
||||
## What exists today
|
||||
- preflight.sh (227 lines) — already collects RAM, disk, load, docker ps, CI, PRs, issues, locks, phase files, worktrees, vault items. Easy to extend.
|
||||
- run-supervisor.toml — priority framework (P0-P4) with auto-fix vs. vault-item escalation. New cross-stack rules slot into existing tiers.
|
||||
- Edge container — has docker socket access, docker CLI installed. Can run docker compose ls, docker stats, docker system df.
|
||||
- projects/*.toml — per-project config with [services].containers field. Could be extended for sibling stack ownership.
|
||||
- AD-006 — external actions go through vault. Supervisor reports foreign stack symptoms but does not auto-remediate.
|
||||
- docker system prune -f — already runs as P1 auto-fix. Currently affects all images symmetrically (the problem this sprint solves).
|
||||
|
||||
## Complexity
|
||||
- Files touched: 3-4 (preflight.sh, run-supervisor.toml, projects/*.toml schema, new knowledge/sibling-stacks.md)
|
||||
- Subsystems: supervisor only — no changes to other agents
|
||||
- Estimated sub-issues: 5-6
|
||||
- Gluecode vs greenfield: 80/20 (extending existing preflight sections and priority rules vs. stack ownership model)
|
||||
|
||||
## Risks
|
||||
1. Docker socket blast radius — mitigated by read-only discovery commands; write actions stay vault-gated for foreign stacks.
|
||||
2. docker system prune collateral — scoping prune to disinto-managed images requires label-based filtering (com.disinto.managed=true), factory images need labeling first.
|
||||
3. Performance of docker stats — mitigated by --no-stream --format for a single snapshot.
|
||||
4. Stack ownership ambiguity — no standard way to identify who owns a foreign compose project. Design fork needed.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
- No new services, cron jobs, or containers. Extends the existing supervisor.
|
||||
- New knowledge file: knowledge/sibling-stacks.md (low maintenance).
|
||||
- Optional TOML schema extension: [siblings] section in project config.
|
||||
- Image labeling convention: com.disinto.managed=true on factory Dockerfiles and compose.
|
||||
|
||||
## Recommendation
|
||||
Worth it. Addresses a real incident (harb-dev-box 98% disk), mostly gluecode extending proven patterns, adds no new services, directly supports Foundation milestone. The one-box-many-stacks model is the common case for resource-constrained dev environments.
|
||||
|
|
@ -1,77 +1,177 @@
|
|||
# Sprint: vault blast-radius tiers
|
||||
# Sprint: Vault blast-radius tiers
|
||||
|
||||
## Vision issues
|
||||
- #419 — Vault: blast-radius based approval tiers
|
||||
|
||||
## What this enables
|
||||
After this sprint, low-tier vault actions execute without waiting for a human. The dispatcher
|
||||
auto-approves and merges vault PRs classified as `low` in `policy.toml`. Medium and high tiers
|
||||
are unchanged: medium notifies and allows async review; high blocks until admin approves.
|
||||
After this sprint, vault operations are classified by blast radius — low-risk operations
|
||||
(docs, feature-branch edits) flow through without human gating; medium-risk operations
|
||||
(CI config, Dockerfile changes) queue for async review; high-risk operations (production
|
||||
deploys, secrets rotation, agent self-modification) hard-block as today.
|
||||
|
||||
This removes the bottleneck on low-risk bookkeeping operations while preserving the hard gate
|
||||
on production deploys, secret operations, and agent self-modification.
|
||||
The practical effect: the dev loop no longer stalls waiting for human approval of routine
|
||||
operations. Agents can move autonomously through 80%+ of vault requests while preserving
|
||||
the safety contract on irreversible operations.
|
||||
|
||||
## What exists today
|
||||
The vault redesign (#73-#77) is complete and all five issues are closed:
|
||||
- lib/vault.sh - idempotent vault PR creation via Forgejo API
|
||||
- docker/edge/dispatcher.sh - polls merged vault PRs, verifies admin approval, launches runners
|
||||
- vault/vault-env.sh - TOML validation for vault action files
|
||||
- vault/SCHEMA.md - vault action TOML schema
|
||||
- lib/branch-protection.sh - admin-only merge enforcement on ops repo
|
||||
|
||||
The tier infrastructure is fully built. Only the enforcement is missing.
|
||||
|
||||
- `vault/policy.toml` — Maps every formula to low/medium/high. Current low tier: groom-backlog,
|
||||
triage, reproduce, review-pr. Medium: dev, run-planner, run-gardener, run-predictor,
|
||||
run-supervisor, run-architect, upgrade-dependency. High: run-publish-site, run-rent-a-human,
|
||||
add-rpc-method, release.
|
||||
- `vault/classify.sh` — Shell classifier called by `vault-env.sh`. Returns tier for a given formula.
|
||||
- `vault/SCHEMA.md` — Documents `blast_radius` override field (string: "low"/"medium"/"high")
|
||||
that vault action TOMLs can use to override policy defaults.
|
||||
- `vault/validate.sh` — Validates vault action TOML fields including blast_radius.
|
||||
- `docker/edge/dispatcher.sh` — Edge dispatcher. Polls ops repo for merged vault PRs and executes
|
||||
them. Currently fires ALL merged vault PRs without tier differentiation.
|
||||
|
||||
What's missing: the dispatcher does not read blast_radius, does not auto-approve low-tier PRs,
|
||||
and does not differentiate notification behavior for medium vs high tier.
|
||||
Currently every vault request goes through the same hard-block path regardless of risk.
|
||||
No classification layer exists. All formulas share the same single approval tier.
|
||||
|
||||
## Complexity
|
||||
|
||||
Files touched: 3
|
||||
- `docker/edge/dispatcher.sh` — read blast_radius from vault action TOML; for low tier, call
|
||||
Forgejo API to approve + merge the PR directly (admin token); for medium, post "pending async
|
||||
review" comment; for high, leave pending (existing behavior)
|
||||
- `lib/vault.sh` `vault_request()` — include blast_radius in the PR body so the dispatcher
|
||||
can read it without re-parsing the TOML
|
||||
- `docs/VAULT.md` — document the three-tier behavior for operators
|
||||
|
||||
Sub-issues: 3
|
||||
Gluecode ratio: ~70% gluecode (dispatcher reads existing classify.sh output), ~30% new (auto-approve API call, comment logic)
|
||||
Files touched: ~14 (7 new, 7 modified)
|
||||
Gluecode vs greenfield: ~60% gluecode, ~40% greenfield.
|
||||
Estimated sub-issues: 4-7 depending on fork choices.
|
||||
|
||||
## Risks
|
||||
1. Classification errors on consequential operations. Default-deny mitigates: unknown formula → high.
|
||||
2. Dispatcher complexity. Mitigation: extract to classify.sh, dispatcher delegates.
|
||||
3. Branch-protection interaction (primary design fork, see below).
|
||||
|
||||
- Admin token for auto-approve: the dispatcher needs an admin-level Forgejo token to approve
|
||||
and merge PRs. Currently `FORGE_TOKEN` is used; branch protection has `admin_enforced: true`
|
||||
which means even admin bots are blocked from bypassing the approval gate. This is the core
|
||||
design fork: either (a) relax admin_enforced for low-tier PRs, or (b) use a separate
|
||||
Forgejo "auto-approver" account with admin rights, or (c) bypass the PR workflow entirely
|
||||
for low-tier actions (execute directly without a PR).
|
||||
- Policy drift: as new formulas are added, policy.toml must be updated. If a formula is missing,
|
||||
classify.sh should default to "high" (fail safe). Currently the default behavior is unknown —
|
||||
this needs to be hardened.
|
||||
- Audit trail: low-tier auto-approvals should still leave a record. Auto-approve comment
|
||||
("auto-approved: low blast radius") satisfies this.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- One new Forgejo account or token (if auto-approver route chosen) — needs rotation policy
|
||||
- `policy.toml` maintenance: every new formula must be classified before shipping
|
||||
- No new services, cron jobs, or containers
|
||||
## Cost - new infra to maintain
|
||||
- vault/policy.toml or blast_radius fields — operators update when adding formulas.
|
||||
- vault/classify.sh — one shell script, shellcheck-covered, no runtime daemon.
|
||||
- No new services, cron jobs, or agent roles.
|
||||
|
||||
## Recommendation
|
||||
Worth it. Vault redesign done; blast-radius tiers are the natural next step. Primary reason
|
||||
agents cannot operate continuously is that every vault action blocks on human availability.
|
||||
|
||||
Worth it, but the design fork on auto-approve mechanism must be resolved before implementation
|
||||
begins — this is the questions step.
|
||||
---
|
||||
|
||||
The cleanest approach is option (c): bypass the PR workflow for low-tier actions entirely.
|
||||
The dispatcher detects blast_radius=low, executes the formula immediately without creating
|
||||
a PR, and writes to `vault/fired/` directly. This avoids the admin token problem, preserves
|
||||
the PR workflow for medium/high, and keeps the audit trail in git. However, it changes the
|
||||
blast_radius=low behavior from "PR exists but auto-merges" to "no PR, just executes" — operators
|
||||
need to understand the difference.
|
||||
## Design forks
|
||||
|
||||
The PR route (option b) is more visible but requires a dedicated account.
|
||||
Three decisions must be made before implementation begins.
|
||||
|
||||
### Fork 1 (Critical): Auto-approve merge mechanism
|
||||
|
||||
Branch protection on the ops repo requires `required_approvals: 1` and `admin_enforced: true`.
|
||||
For low-tier vault PRs, the dispatcher must merge without a human approval.
|
||||
|
||||
**A. Skip PR entirely for low-tier**
|
||||
vault-bot commits directly to `vault/actions/` on main using admin token. No PR created.
|
||||
Dispatcher detects new TOML file by absence of `.result.json`.
|
||||
- Simplest dispatcher code
|
||||
- No PR audit trail for low-tier executions
|
||||
- `FORGE_ADMIN_TOKEN` already exists in vault env (used by `is_user_admin()`)
|
||||
|
||||
**B. Dispatcher self-approves low-tier PRs**
|
||||
vault-bot creates PR as today, then immediately posts an APPROVED review using its own token,
|
||||
then merges. vault-bot needs Forgejo admin role so `admin_enforced: true` does not block it.
|
||||
- Full PR audit trail for all tiers
|
||||
- Requires granting vault-bot admin role on Forgejo
|
||||
|
||||
**C. Tier-aware branch protection**
|
||||
Create a separate Forgejo protection rule for `vault/*` branch pattern with `required_approvals: 0`.
|
||||
Main branch protection stays unchanged. vault-bot merges low-tier PRs directly.
|
||||
- No new accounts or elevated role for vault-bot
|
||||
- Protection rules are in Forgejo admin UI, not code (harder to version)
|
||||
- Forgejo `vault/*` glob support needs verification
|
||||
|
||||
**D. Dedicated auto-approve bot**
|
||||
Create a `vault-auto-bot` Forgejo account with admin role that auto-approves low-tier PRs.
|
||||
Cleanest trust separation; most operational overhead.
|
||||
|
||||
---
|
||||
|
||||
### Fork 2 (Secondary): Policy storage format
|
||||
|
||||
Where does the formula → tier mapping live?
|
||||
|
||||
**A. `vault/policy.toml` in disinto repo**
|
||||
Flat TOML: `formula = "tier"`. classify.sh reads it at runtime.
|
||||
Unknown formulas default to `high`. Changing policy requires a disinto PR.
|
||||
|
||||
**B. `blast_radius` field in each `formulas/*.toml`**
|
||||
Add `blast_radius = "low"|"medium"|"high"` to each formula TOML.
|
||||
classify.sh reads the target formula TOML for its tier.
|
||||
Co-located with formula — impossible to add a formula without declaring its risk.
|
||||
|
||||
**C. `vault/policy.toml` in ops repo**
|
||||
Same format as A but lives in the ops repo. Operators update without a disinto PR.
|
||||
Useful for per-deployment overrides.
|
||||
|
||||
**D. Hybrid: formula TOML default + ops override**
|
||||
Formula TOML carries a default tier. Ops `vault/policy.toml` can override per-deployment.
|
||||
Most flexible; classify.sh must merge two sources.
|
||||
|
||||
---
|
||||
|
||||
### Fork 3 (Secondary): Medium-tier dev-loop behavior
|
||||
|
||||
When dev-agent creates a vault PR for a medium-tier action, what does it do while waiting?
|
||||
|
||||
**A. Non-blocking: fire and continue immediately**
|
||||
Agent creates vault PR and moves to next issue without waiting.
|
||||
Maximum autonomy; sequencing becomes unpredictable.
|
||||
|
||||
**B. Soft-block with 2-hour timeout**
|
||||
Agent waits up to 2 hours polling for vault PR merge. If no response, continues.
|
||||
Balances oversight with velocity.
|
||||
|
||||
**C. Status-quo block (medium = high)**
|
||||
Medium-tier blocks the agent loop like high-tier today. Only low-tier actions unblocked.
|
||||
Simplest behavior change — no modification to dev-agent flow needed.
|
||||
|
||||
**D. Label-based approval signal**
|
||||
Agent polls for a `vault-approved` label on the vault PR instead of waiting for merge.
|
||||
Decouples "approved to continue" from "PR merged and executed."
|
||||
|
||||
---
|
||||
|
||||
## Proposed sub-issues
|
||||
|
||||
### Core (always filed regardless of fork choices)
|
||||
|
||||
**Sub-issue 1: vault/classify.sh — classification engine**
|
||||
Implement `vault/classify.sh`: reads formula name, secrets, optional `blast_radius` override,
|
||||
applies policy rules, outputs tier (`low|medium|high`). Default-deny: unknown → `high`.
|
||||
Files: `vault/classify.sh` (new), `vault/vault-env.sh` (call classify)
|
||||
|
||||
**Sub-issue 2: docs/BLAST-RADIUS.md and SCHEMA.md update**
|
||||
Write `docs/BLAST-RADIUS.md`. Add optional `blast_radius` field to `vault/SCHEMA.md`
|
||||
and validator.
|
||||
Files: `docs/BLAST-RADIUS.md` (new), `vault/SCHEMA.md`, `vault/vault-env.sh`
|
||||
|
||||
**Sub-issue 3: Update prerequisites.md**
|
||||
Mark vault redesign (#73-#77) as DONE (stale). Add blast-radius tiers to the tree.
|
||||
Files: `disinto-ops/prerequisites.md`
|
||||
|
||||
### Fork 1 variants (pick one)
|
||||
|
||||
**1A** — Modify `lib/vault.sh` to skip PR for low-tier, commit directly to main.
|
||||
Modify `dispatcher.sh` to skip `verify_admin_merged()` for low-tier TOMLs.
|
||||
|
||||
**1B** — Modify `dispatcher.sh` to post APPROVED review + merge for low-tier.
|
||||
Grant vault-bot admin role in Forgejo setup scripts.
|
||||
|
||||
**1C** — Add `setup_vault_branch_protection_tiered()` to `lib/branch-protection.sh`
|
||||
with `required_approvals: 0` for `vault/*` pattern (verify Forgejo glob support first).
|
||||
|
||||
**1D** — Add `vault-auto-bot` account to `forge-setup.sh`. Implement approval watcher.
|
||||
|
||||
### Fork 2 variants (pick one)
|
||||
|
||||
**2A** — Create `vault/policy.toml` in disinto repo. classify.sh reads it.
|
||||
|
||||
**2B** — Add `blast_radius` field to all 15 `formulas/*.toml`. classify.sh reads formula TOML.
|
||||
|
||||
**2C** — Create `disinto-ops/vault/policy.toml`. classify.sh reads ops copy at runtime.
|
||||
|
||||
**2D** — Two-pass classify.sh: formula TOML default, ops policy override.
|
||||
|
||||
### Fork 3 variants (pick one)
|
||||
|
||||
**3A** — Non-blocking: `lib/vault.sh` returns immediately after PR creation for all tiers.
|
||||
|
||||
**3B** — Soft-block: poll medium-tier PR every 15 min for up to 2 hours.
|
||||
|
||||
**3C** — No change: medium-tier behavior unchanged (only low-tier unblocked).
|
||||
|
||||
**3D** — Create `vault-approved` label. Modify `lib/vault.sh` medium path to poll label.
|
||||
|
|
|
|||
|
|
@ -1,233 +0,0 @@
|
|||
# Sprint: versioned agent images
|
||||
|
||||
## Vision issues
|
||||
- #429 — feat: publish versioned agent images — compose should use image: not build:
|
||||
|
||||
## What this enables
|
||||
After this sprint, `disinto init` produces a `docker-compose.yml` that pulls a pinned image
|
||||
from a registry instead of building from source. A new factory instance needs only a token
|
||||
and a config file — no clone, no build, no local Docker context. This closes the gap between
|
||||
"works on my machine" and "one-command bootstrap."
|
||||
|
||||
It also enables rollback: if agents misbehave after an upgrade, `AGENTS_IMAGE=v0.1.1 disinto up`
|
||||
restores the previous version without touching the codebase.
|
||||
|
||||
## What exists today
|
||||
|
||||
The release pipeline is more complete than it looks:
|
||||
|
||||
- `formulas/release.toml` — 7-step release formula. Steps 4-5 already build and tag the image
|
||||
locally (`docker compose build --no-cache agents`, `docker tag disinto-agents disinto-agents:$RELEASE_VERSION`).
|
||||
The gap: no push step, no registry target.
|
||||
- `lib/release.sh` — Creates vault TOML and ops repo PR for the release. No image version wired
|
||||
into compose generation.
|
||||
- `lib/generators.sh` `_generate_compose_impl()` — Generates compose with `build: context: .
|
||||
dockerfile: docker/agents/Dockerfile` for agents, runner, reproduce, edge. Version-unaware.
|
||||
- `vault/vault-env.sh` — `DOCKER_HUB_TOKEN` is in `VAULT_ALLOWED_SECRETS`. Not currently used.
|
||||
- `docker/agents/Dockerfile` — No VOLUME declarations; runtime state, repos, and config are
|
||||
mounted via compose but not declared. Claude binary injected by compose at init time.
|
||||
|
||||
## Complexity
|
||||
|
||||
Files touched: 4
|
||||
- `formulas/release.toml` — add `push-image` step (after tag-image, before restart-agents)
|
||||
- `lib/generators.sh` — `_generate_compose_impl()` reads `AGENTS_IMAGE` env var; emits
|
||||
`image:` when set, falls back to `build:` when not set (dev mode)
|
||||
- `docker/agents/Dockerfile` — add explicit VOLUME declarations for /home/agent/data,
|
||||
/home/agent/repos, /home/agent/disinto/projects, /home/agent/disinto/state
|
||||
- `bin/disinto` `disinto_up()` — pass `AGENTS_IMAGE` through to compose if set in `.env`
|
||||
|
||||
Subsystems: release formula, compose generation, Dockerfile hygiene
|
||||
Sub-issues: 3
|
||||
Gluecode ratio: ~80% gluecode (release step, VOLUME declarations), ~20% new (AGENTS_IMAGE env var path)
|
||||
|
||||
## Risks
|
||||
|
||||
- Registry credentials: `DOCKER_HUB_TOKEN` is in vault allowlist but not wired up. The push step
|
||||
needs a registry login — either Docker Hub (DOCKER_HUB_TOKEN) or GHCR (GITHUB_TOKEN, already
|
||||
in vault). The sprint spec must pick one and add the credential to the release vault TOML.
|
||||
- Volume shadow: if VOLUME declarations don't match the compose volume mounts exactly, runtime
|
||||
files land in anonymous volumes instead of named ones. Must test before shipping.
|
||||
- Existing deployments: currently on `build:`. Migration: set AGENTS_IMAGE in .env, re-run
|
||||
`disinto init` (compose is regenerated), restart. No SSH, no worktree needed.
|
||||
- `runner` service: same image as agents, same version. Must update runner service in compose gen too.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- Registry account + token rotation: one vault secret (DOCKER_HUB_TOKEN) needs rotation policy.
|
||||
GHCR (via GITHUB_TOKEN) has no additional account but ties release to GitHub.
|
||||
- Release formula grows from 7 to 8 steps. Small maintenance surface.
|
||||
- `AGENTS_IMAGE` becomes a documented env var in .env for pinned deployments. Needs docs.
|
||||
|
||||
## Recommendation
|
||||
|
||||
Worth it. The release formula is 90% done — one push step closes the gap. The compose
|
||||
generation change is purely additive (AGENTS_IMAGE env var, fallback to build: for dev).
|
||||
Volume declarations are hygiene that should exist regardless of versioning.
|
||||
|
||||
Pick GHCR over Docker Hub: GITHUB_TOKEN is already in the vault allowlist and ops repo.
|
||||
No new account needed.
|
||||
|
||||
## Side effects of this sprint
|
||||
|
||||
Beyond versioned images, this sprint indirectly closes one open bug:
|
||||
|
||||
- **#665 (edge cold-start race)** — `disinto-edge` currently exits with code 128 on a cold
|
||||
`disinto up` because its entrypoint clones from `forgejo:3000` before forgejo's HTTP
|
||||
listener is up. Once edge's image embeds the disinto source at build time (no runtime
|
||||
clone), the race vanishes. The `depends_on: { forgejo: { condition: service_healthy } }`
|
||||
workaround proposed in #665 becomes unnecessary.
|
||||
|
||||
Worth flagging explicitly so a dev bot working on #665 doesn't apply that workaround in
|
||||
parallel — it would be churn this sprint deletes anyway.
|
||||
|
||||
## What this sprint does not yet enable
|
||||
|
||||
This sprint delivers versioned images and pinned compose. It is a foundation, not the
|
||||
whole client-box upgrade story. Four follow-up sprints complete the picture for harb-style
|
||||
client boxes — each independently scopable, with the dependency chain noted.
|
||||
|
||||
### Follow-up A: `disinto upgrade <version>` subcommand
|
||||
|
||||
**Why**: even with versioned images, an operator on a client box still has to coordinate
|
||||
multiple steps to upgrade — `git fetch && git checkout`, edit `.env` to set
|
||||
`AGENTS_IMAGE`, re-run `_generate_compose_impl`, `docker compose pull`,
|
||||
`docker compose up -d --force-recreate`, plus any out-of-band migrations. There is no
|
||||
single atomic command. Without one, "upgrade harb to v0.3.0" stays a multi-step human
|
||||
operation that drifts out of sync.
|
||||
|
||||
**Shape**:
|
||||
|
||||
```
|
||||
disinto upgrade v0.3.0
|
||||
```
|
||||
|
||||
Sequence (roughly):
|
||||
|
||||
1. `git fetch --tags` and verify the tag exists
|
||||
2. Bail if the working tree is dirty
|
||||
3. `git checkout v0.3.0`
|
||||
4. `_env_set_idempotent AGENTS_IMAGE v0.3.0 .env` (helper from #641)
|
||||
5. Re-run `_generate_compose_impl` (picks up the new image tag)
|
||||
6. Run pre-upgrade migration hooks (Follow-up C)
|
||||
7. `docker compose pull && docker compose up -d --force-recreate`
|
||||
8. Run post-upgrade migration hooks
|
||||
9. Health check; rollback to previous version on failure
|
||||
10. Log result
|
||||
|
||||
**Files touched**: `bin/disinto` (~150 lines, new `disinto_upgrade()` function), possibly
|
||||
extracted to a new `lib/upgrade.sh` if it grows large enough to warrant separation.
|
||||
|
||||
**Dependency**: this sprint (needs `AGENTS_IMAGE` to be a real thing in `.env` and in the
|
||||
compose generator).
|
||||
|
||||
### Follow-up B: unify `DISINTO_VERSION` and `AGENTS_IMAGE`
|
||||
|
||||
**Why**: today there are two version concepts in the codebase:
|
||||
|
||||
- `DISINTO_VERSION` — used at `docker/edge/entrypoint-edge.sh:84` for the in-container
|
||||
source clone (`git clone --branch ${DISINTO_VERSION:-main}`). Defaults to `main`. Also
|
||||
set in the compose generator at `lib/generators.sh:397` for the edge service.
|
||||
- `AGENTS_IMAGE` — proposed by this sprint for the docker image tag in compose.
|
||||
|
||||
These should be **the same value**. If you are running the `v0.3.0` agents image, the
|
||||
in-container source (if any clone still happens) should also be at `v0.3.0`. Otherwise
|
||||
you get a v0.3.0 binary running against v-something-else source, which is exactly the
|
||||
silent drift versioning is meant to prevent.
|
||||
|
||||
After this sprint folds source into the image, `DISINTO_VERSION` in containers becomes
|
||||
vestigial. The follow-up: pick one name (probably keep `DISINTO_VERSION` since it is
|
||||
referenced in more places), have `_generate_compose_impl` set both `image:` and the env
|
||||
var from the same source, and delete the redundant runtime clone block in
|
||||
`entrypoint-edge.sh`.
|
||||
|
||||
**Files touched**: `lib/generators.sh`, `docker/edge/entrypoint-edge.sh` (delete the
|
||||
runtime clone block once the image carries source), possibly `lib/env.sh` for the
|
||||
default value.
|
||||
|
||||
**Dependency**: this sprint.
|
||||
|
||||
### Follow-up C: migration framework for breaking changes
|
||||
|
||||
**Why**: some upgrades have side effects beyond "new code in the container":
|
||||
|
||||
- The CLAUDE_CONFIG_DIR migration (#641 → `setup_claude_config_dir` in
|
||||
`lib/claude-config.sh`) needs a one-time `mkdir + mv + symlink` per host.
|
||||
- The credential-helper cleanup (#669; #671 for the safety-net repair) needs in-volume
|
||||
URL repair.
|
||||
- Future: schema changes in the vault, ops repo restructures, env var renames.
|
||||
|
||||
There is no `disinto/migrations/v0.3.0.sh` style framework. Existing migrations live
|
||||
ad-hoc inside `disinto init` and run unconditionally on init. That works for fresh
|
||||
installs but not for "I'm upgrading from v0.2.0 to v0.3.0 and need migrations
|
||||
v0.2.1 → v0.2.2 → v0.3.0 to run in order".
|
||||
|
||||
**Shape**: a `migrations/` directory with one file per version (`v0.3.0.sh`,
|
||||
`v0.3.1.sh`, …). `disinto upgrade` (Follow-up A) invokes each migration file in order
|
||||
between the previous applied version and the target. Track the applied version in
|
||||
`.env` (e.g. `DISINTO_LAST_MIGRATION=v0.3.0`) or in `state/`. Standard
|
||||
rails/django/flyway pattern. The framework itself is small; the value is in having a
|
||||
place for migrations to live so they are not scattered through `disinto init` and lost
|
||||
in code review.
|
||||
|
||||
**Files touched**: `lib/upgrade.sh` (the upgrade command is the natural caller), new
|
||||
`migrations/` directory, a tracking key in `.env` for the last applied migration
|
||||
version.
|
||||
|
||||
**Dependency**: Follow-up A (the upgrade command is the natural caller).
|
||||
|
||||
### Follow-up D: bootstrap-from-broken-state runbook
|
||||
|
||||
**Why**: this sprint and Follow-ups A–C describe the steady-state upgrade flow. But
|
||||
existing client boxes — harb-dev-box specifically — are not in steady state. harb's
|
||||
working tree is at tag `v0.2.0` (months behind main). Its containers are running locally
|
||||
built `:latest` images of unknown vintage. Some host-level state (`CLAUDE_CONFIG_DIR`,
|
||||
`~/.git/config` credential helper from the disinto-dev-box rollout) has not been applied
|
||||
on harb yet. The clean upgrade flow cannot reach harb from where it currently is — there
|
||||
is too much drift.
|
||||
|
||||
Each existing client box needs a **one-time manual reset** to a known-good baseline
|
||||
before the versioned upgrade flow takes over. The reset is mechanical but not
|
||||
automatable — it touches host-level state that pre-dates the new flow.
|
||||
|
||||
**Shape**: a documented runbook at `docs/client-box-bootstrap.md` (or similar) that
|
||||
walks operators through the one-time reset:
|
||||
|
||||
1. `disinto down`
|
||||
2. `git fetch --all && git checkout <latest tag>` on the working tree
|
||||
3. Apply host-level migrations:
|
||||
- `setup_claude_config_dir true` (from `lib/claude-config.sh`, added in #641)
|
||||
- Strip embedded creds from `.git/config`'s forgejo remote and add the inline
|
||||
credential helper using the pattern from #669
|
||||
- Rotate `FORGE_PASS` and `FORGE_TOKEN` if they have leaked (separate decision)
|
||||
4. Rebuild images (`docker compose build`) or pull from registry once this sprint lands
|
||||
5. `disinto up`
|
||||
6. Verify with `disinto status` and a smoke fetch through the credential helper
|
||||
|
||||
After the reset, the box is in a known-good baseline and `disinto upgrade <version>`
|
||||
takes over for all subsequent upgrades. The runbook documents this as the only manual
|
||||
operation an operator should ever have to perform on a client box.
|
||||
|
||||
**Files touched**: new `docs/client-box-bootstrap.md`. Optionally a small change to
|
||||
`disinto init` to detect "this looks like a stale-state box that needs the reset
|
||||
runbook, not a fresh init" and refuse with a pointer to the runbook.
|
||||
|
||||
**Dependency**: none (can be done in parallel with this sprint and the others).
|
||||
|
||||
## Updated recommendation
|
||||
|
||||
The original recommendation stands: this sprint is worth it, ~80% gluecode, GHCR over
|
||||
Docker Hub. Layered on top:
|
||||
|
||||
- **Sequence the four follow-ups**: A (upgrade subcommand) and D (bootstrap runbook) are
|
||||
independent of this sprint's image work and can land in parallel. B (version
|
||||
unification) is small cleanup that depends on this sprint. C (migration framework) can
|
||||
wait until the first migration that actually needs it — `setup_claude_config_dir`
|
||||
doesn't, since it already lives in `disinto init`.
|
||||
- **Do not fix #665 in parallel**: as noted in "Side effects", this sprint deletes the
|
||||
cause. A `depends_on: service_healthy` workaround applied to edge in parallel would be
|
||||
wasted work.
|
||||
- **Do not file separate forge issues for the follow-ups until this sprint is broken into
|
||||
sub-issues**: keep them in this document until the architect (or the operator) is ready
|
||||
to commit to a sequence. That avoids backlog clutter and lets the four follow-ups stay
|
||||
reorderable as the sprint shape evolves.
|
||||
|
|
@ -1,105 +0,0 @@
|
|||
# Sprint: website observability wire-up
|
||||
|
||||
## Vision issues
|
||||
- #426 — Website observability — make disinto.ai an observable addressable
|
||||
|
||||
## What this enables
|
||||
After this sprint, the factory can read engagement data from disinto.ai. The planner
|
||||
will have daily evidence files in `evidence/engagement/` to answer: how many people
|
||||
visited, where they came from, which pages they viewed. Observables will exist.
|
||||
The prerequisites for two milestones unlock:
|
||||
- Adoption: "Landing page communicating value proposition" (evidence confirms it works)
|
||||
- Ship (Fold 2): "Engagement measurement baked into deploy pipelines" (verify-observable step becomes non-advisory)
|
||||
|
||||
## What exists today
|
||||
|
||||
The design and most of the code are already done:
|
||||
|
||||
- `site/collect-engagement.sh` — Complete. Parses Caddy's JSON access log, computes unique visitors / page views / top referrers, writes dated JSON evidence to `$OPS_REPO_ROOT/evidence/engagement/YYYY-MM-DD.json`.
|
||||
- `formulas/run-publish-site.toml` verify-observable step — Complete. Checks Caddy log activity, script presence, and evidence recency on every deploy.
|
||||
- `docs/EVIDENCE-ARCHITECTURE.md` — Documents the full pipeline: Caddy logs → collect-engagement → evidence/engagement/
|
||||
- `docs/OBSERVABLE-DEPLOY.md` — Documents the observable deploy pattern.
|
||||
- `docker/edge/Dockerfile` — Caddy edge container exists for the factory.
|
||||
|
||||
What's missing is the wiring: connecting the factory to the remote Caddy host where
|
||||
disinto.ai runs.
|
||||
|
||||
## Complexity
|
||||
|
||||
Files touched: 4-6 depending on fork choices
|
||||
Subsystems: vault dispatch, SSH access, log collection, ops repo evidence
|
||||
Sub-issues: 3-4
|
||||
Gluecode ratio: ~80% gluecode, ~20% greenfield (the container/formula is new)
|
||||
|
||||
## Risks
|
||||
|
||||
- Production Caddy is on a separate host from the factory — all collection must go over SSH.
|
||||
- Log format mismatch: collect-engagement.sh assumes Caddy's structured JSON format. If the production Caddy uses default Combined Log Format, the script will produce empty reports silently.
|
||||
- SSH key scope: the key used for collection should be purpose-limited to avoid granting broad access.
|
||||
- Evidence commit: the container must commit evidence to the ops repo via Forgejo API (not git push over SSH) to keep the secret surface minimal.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- One vault action formula (`formulas/collect-engagement.toml` or extension of existing formula)
|
||||
- One SSH key on the Caddy host's authorized_keys
|
||||
- Daily evidence files in ops repo (evidence/engagement/*.json) — ~1KB/file
|
||||
- No new long-running services or agents
|
||||
|
||||
## Recommendation
|
||||
|
||||
Worth it. The human-directed architecture (dispatchable container with SSH) is
|
||||
cleaner than running cron directly on the production host — it keeps all factory
|
||||
logic inside the factory and treats the Caddy host as a dumb data source.
|
||||
|
||||
## Design forks
|
||||
|
||||
### Q1: What does the container fetch from the Caddy host?
|
||||
|
||||
*Context: `collect-engagement.sh` already parses Caddy JSON access logs into evidence JSON. The question is where that parsing happens.*
|
||||
|
||||
- **(A) Fetch raw log, process locally**: Container SSHs in, copies today's access log segment (e.g. `rsync` or `scp`), then runs `collect-engagement.sh` inside the container against the local copy. The Caddy host needs zero disinto code installed.
|
||||
- **(B) Run script remotely**: Container SSHs in and executes `collect-engagement.sh` on the Caddy host. Requires the script (or a minimal version) to be deployed on the host. Output piped back.
|
||||
- **(C) Pull Caddy metrics API**: Container opens an SSH tunnel to Caddy's admin API (port 2019) and pulls request metrics directly. No log file parsing — but Caddy's metrics endpoint is less rich than full access log analysis (no referrers, no per-page breakdown).
|
||||
|
||||
*Architect recommends (A): keeps the Caddy host dumb, all logic in the factory container, and `collect-engagement.sh` runs unchanged.*
|
||||
|
||||
### Q2: How is the daily collection triggered?
|
||||
|
||||
*Context: Other factory agents (supervisor, planner, gardener) run on direct cron via `*-run.sh`. Vault actions go through the PR approval workflow. The collection is a recurring low-risk read-only operation.*
|
||||
|
||||
- **(A) Direct cron in edge container**: Add a cron entry to the edge container entrypoint, like supervisor/planner. Simple, no vault overhead. Runs daily without approval.
|
||||
- **(B) Vault action with auto-dispatch**: Create a recurring vault action TOML. If PR #12 (blast-radius tiers) lands, low-tier actions auto-execute. If not, each run needs admin approval — too heavy for daily collection.
|
||||
- **(C) Supervisor-triggered**: Supervisor detects stale evidence (no `evidence/engagement/` file for today) and dispatches collection. Reactive rather than scheduled.
|
||||
|
||||
*Architect recommends (A): this is a read-only data collection, same risk profile as supervisor health checks. Vault gating a daily log fetch adds friction without security benefit.*
|
||||
|
||||
### Q3: How is the SSH key provisioned for the collection container?
|
||||
|
||||
*Context: The vault dispatcher supports `mounts: ["ssh"]` which mounts `~/.ssh` read-only into the container. The edge container already has SSH infrastructure for reverse tunnels (`disinto-tunnel` user, `autossh`).*
|
||||
|
||||
- **(A) Factory operator's SSH keys** (`mounts: ["ssh"]`): Reuse the existing SSH keys on the factory host. Simple, but grants the container access to all hosts the operator can reach.
|
||||
- **(B) Dedicated purpose-limited key**: Generate a new SSH keypair, install the public key on the Caddy host with `command=` restriction (only allows `cat /var/log/caddy/access.log` or similar). Private key stored as `CADDY_SSH_KEY` in `.env.vault.enc`. Minimal blast radius.
|
||||
- **(C) Edge tunnel reverse path**: Instead of the factory SSHing *out* to Caddy, have the Caddy host push logs *in* via the existing reverse tunnel infrastructure. Inverts the connection direction but requires a log-push agent on the Caddy host.
|
||||
|
||||
*Architect recommends (B): purpose-limited key with `command=` restriction on the Caddy host gives least-privilege access. The factory never gets a shell on production.*
|
||||
|
||||
## Proposed sub-issues
|
||||
|
||||
### If Q1=A, Q2=A, Q3=B (recommended path):
|
||||
|
||||
1. **`collect-engagement` formula + container script**: Create `formulas/collect-engagement.toml` with steps: SSH into Caddy host using dedicated key → fetch today's access log segment → run `collect-engagement.sh` on local copy → commit evidence JSON to ops repo via Forgejo API. Add cron entry to edge container.
|
||||
2. **Format-detection guard in `collect-engagement.sh`**: Add a check at script start that verifies the input file is Caddy JSON format (not Combined Log Format). Fail loudly with actionable error if format is wrong.
|
||||
3. **`evidence/engagement/` directory + ops-setup wiring**: Ensure `lib/ops-setup.sh` creates the evidence directory. Register the engagement cron schedule in factory setup docs.
|
||||
4. **Document Caddy host SSH setup**: Rent-a-human instructions for: generate keypair, install public key with `command=` restriction on Caddy host, add private key to `.env.vault.enc`.
|
||||
|
||||
### If Q1=B (remote execution):
|
||||
Sub-issues 2-4 remain the same. Sub-issue 1 changes: container SSHs in and runs the script remotely, requiring script deployment on the Caddy host (additional manual step).
|
||||
|
||||
### If Q2=B (vault-gated):
|
||||
Sub-issue 1 changes: instead of cron, create a vault action TOML template and document the daily dispatch. Depends on PR #12 (blast-radius tiers) for auto-approval.
|
||||
|
||||
### If Q3=A (operator SSH keys):
|
||||
Sub-issue 4 is simplified (no dedicated key generation), but blast radius is wider.
|
||||
|
||||
### If Q3=C (reverse tunnel):
|
||||
Sub-issue 1 changes significantly: instead of SSH-out, configure a log-push cron on the Caddy host that sends logs through the reverse tunnel. More infrastructure on the Caddy host side.
|
||||
23
vault/actions/fix-ops-branch-protection-20260415.toml
Normal file
23
vault/actions/fix-ops-branch-protection-20260415.toml
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
# Vault action: fix-ops-branch-protection-20260415
|
||||
# Filed by: gardener (2026-04-15)
|
||||
# Unblocks: #758, #765
|
||||
|
||||
context = "Ops repo (disinto-admin/disinto-ops) branch protection on main requires approvals but no bot account has sufficient permissions to merge PRs. planner-bot has push but cannot merge. review-bot can approve but cannot push/merge. ops/main frozen at v0.2.0 since 2026-04-08. Knowledge, vault items, and sprint artifacts accumulate locally and are lost on container restart."
|
||||
|
||||
unblocks = ["#758", "#765"]
|
||||
|
||||
[action_required]
|
||||
description = """
|
||||
Choose ONE of the following:
|
||||
|
||||
Option 1 (recommended): Add planner-bot to the merge allowlist in disinto-ops branch protection.
|
||||
Forgejo admin UI: disinto-admin/disinto-ops > Settings > Branches > main > Edit
|
||||
Under 'Whitelist Merge': add planner-bot
|
||||
|
||||
Option 2: Remove branch protection from disinto-ops main.
|
||||
Agents are the primary writers; branch protection adds friction without safety benefit here.
|
||||
|
||||
Option 3: Create an admin-level FORGE_ADMIN_TOKEN and add to agent secrets.
|
||||
Create a Forgejo admin user or promote an existing bot, issue a token,
|
||||
add to agent container environment as FORGE_ADMIN_TOKEN.
|
||||
"""
|
||||
0
vault/approved/.gitkeep
Normal file
0
vault/approved/.gitkeep
Normal file
0
vault/fired/.gitkeep
Normal file
0
vault/fired/.gitkeep
Normal file
0
vault/pending/.gitkeep
Normal file
0
vault/pending/.gitkeep
Normal file
31
vault/pending/disinto-ops-branch-protection.md
Normal file
31
vault/pending/disinto-ops-branch-protection.md
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
# Request: Remove or relax ops repo branch protection for agent writes
|
||||
|
||||
## What
|
||||
The ops repo (`disinto-ops`) has branch protection on `main` that requires approvals, but no bot account has sufficient permissions to merge. The `planner-bot` has push access but cannot merge. The `review-bot` can approve but cannot push or merge. No admin token is available to agents.
|
||||
|
||||
This means `prerequisites.md`, `knowledge/planner-memory.md`, and vault items have been accumulating **only locally** since planner run 2 (2026-04-08). The remote `origin/main` is frozen.
|
||||
|
||||
## Why
|
||||
Blocks #758 (ops repo branch protection), which blocks ALL agent ops-repo writes: planner prerequisite tree, planner memory, evidence collection, vault pending items. Every agent that writes to the ops repo is silently failing.
|
||||
|
||||
Downstream: blocks website observability (#426), collect-engagement (#745), and the entire evidence pipeline.
|
||||
|
||||
Waiting since 2026-04-08 (first observed planner run 2).
|
||||
|
||||
## Human action
|
||||
1. In Forgejo, go to `disinto-ops` → Settings → Branch Protection → `main`
|
||||
2. Either:
|
||||
- **Option A (recommended):** Remove branch protection from `disinto-ops` entirely — the ops repo is an internal artifact, not production code. Agent writes should flow freely.
|
||||
- **Option B:** Add `planner-bot` and `dev-bot` to the push/merge allowlist so they can push directly to `main`.
|
||||
3. Verify by running: `cd disinto-ops && git push origin main` from the agents container.
|
||||
|
||||
## Factory will then
|
||||
- Planner will push prerequisite tree updates and memory to `origin/main`
|
||||
- Evidence collection (#745) will unblock — collect-engagement formula can commit to ops repo
|
||||
- Vault pending items will be visible on the remote for human review
|
||||
- All agents writing to ops repo will resume normal operation
|
||||
|
||||
## Unblocks
|
||||
- #758 — ops repo branch protection blocks all agent writes
|
||||
- #745 — collect-engagement formula (indirectly, if the no_push is ops-related)
|
||||
- #426 — website observability (downstream)
|
||||
0
vault/rejected/.gitkeep
Normal file
0
vault/rejected/.gitkeep
Normal file
Loading…
Add table
Add a link
Reference in a new issue