Fixes #757 ## Changes Separate operations from code into {project}-ops repo pattern. Added OPS_REPO_ROOT infrastructure (env.sh, load-project.sh, formula-session.sh with ensure_ops_repo helper). Updated all 8 agent scripts and 7 formulas to read/write vault items, journals, evidence, prerequisites, RESOURCES.md, and knowledge from the ops repo. Added setup_ops_repo() to disinto init for automatic ops repo creation and seeding. Removed migrated data from code repo (vault data dirs, planner journal/memory/prerequisites, supervisor journal/best-practices, evidence, RESOURCES.md). Updated all documentation. 55 files changed, ShellCheck clean, all 38 phase tests pass. Co-authored-by: openhands <openhands@all-hands.dev> Reviewed-on: https://codeberg.org/johba/disinto/pulls/767 Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
158 lines
9.3 KiB
Markdown
158 lines
9.3 KiB
Markdown
# Evidence Architecture — Roadmap
|
|
|
|
> **Status: Partially Implemented** — This document describes the target evidence architecture. Items marked **Implemented** exist in the codebase; items marked **Partial** have upstream scripts but no evidence output yet; all others are **Planned**. See AGENTS.md for the current operational state.
|
|
|
|
Disinto is purpose-built for one loop: **build software, launch it, improve it, reach market fit.**
|
|
|
|
This document describes how autonomous agents will sense the world, produce evidence, and use that evidence to make decisions — from "which issue to work on next" to "is this ready to deploy."
|
|
|
|
## The Loop
|
|
|
|
```
|
|
build → measure → evidence good enough?
|
|
no → improve → build again
|
|
yes → deploy → measure in-market → evidence still good?
|
|
no → improve → build again
|
|
yes → expand
|
|
```
|
|
|
|
Every decision in this loop will be driven by evidence, not intuition. The planner will read structured evidence across all dimensions, identify the weakest one, and focus there.
|
|
|
|
## Evidence as Integration Layer
|
|
|
|
Different domains have different platforms:
|
|
|
|
| Domain | Platform | What it tracks | Status |
|
|
|--------|----------|---------------|--------|
|
|
| Code | forge | Issues, PRs, reviews | **Implemented** — Live |
|
|
| CI/CD | Woodpecker | Build/test results | **Implemented** — Live |
|
|
| Protocol | Ponder / GraphQL | On-chain state, trades, positions | **Partial** — Live (not yet wired to evidence) |
|
|
| Infrastructure | DigitalOcean / system stats | CPU, RAM, disk, containers | **Planned** — Supervisor monitors, no evidence output yet |
|
|
| User experience | Playwright personas | Conversion, friction, journey completion | **Partial** — Scripts exist (`run-usertest.sh`), no evidence output yet |
|
|
| Engagement | Caddy access logs | Visitors, referral sources, page paths | **Implemented** — `site/collect-engagement.sh` |
|
|
| Funnel | Analytics (future) | Bounce rate, conversion, retention | **Planned** — Not started |
|
|
|
|
Agents won't need to understand each platform. **Processes act as adapters** — they will read a platform's API and write structured evidence to git.
|
|
|
|
```
|
|
[Caddy logs] ──→ collect-engagement process ──→ {project}-ops/evidence/engagement/YYYY-MM-DD.json
|
|
[Google Analytics] ──→ measure-funnel process ──→ {project}-ops/evidence/funnel/YYYY-MM-DD.json
|
|
[Ponder GraphQL] ──→ measure-protocol process ──→ {project}-ops/evidence/protocol/YYYY-MM-DD.json
|
|
[System stats] ──→ measure-resources process ──→ {project}-ops/evidence/resources/YYYY-MM-DD.json
|
|
[Playwright] ──→ run-user-test process ──→ {project}-ops/evidence/user-test/YYYY-MM-DD.json
|
|
```
|
|
|
|
The planner will read `$OPS_REPO_ROOT/evidence/` — not Analytics, not Ponder, not DigitalOcean. Evidence is the normalized interface between the world and decisions.
|
|
|
|
> **Terminology note — "process" vs "formula":** In this document, "process" means a self-contained measurement or mutation pipeline that reads an external platform and writes structured evidence to git. This is distinct from disinto's "formulas" (`formulas/*.toml`), which are TOML issue templates that guide agents through multi-step operational work (see `AGENTS.md` § Directory layout). Processes produce evidence; formulas orchestrate agent tasks.
|
|
|
|
## Process Types
|
|
|
|
### Sense processes
|
|
|
|
Produce evidence without modifying the project under test. Some sense processes are pure reads (API calls, system stats); others — `run-holdout` and `run-user-test` — spawn a Docker stack (containers, volumes, networks) that requires the Docker daemon and leaves ephemeral state on the host until explicitly torn down. These are **not** safe to treat as no-op: they consume resources and mutate host-level Docker state.
|
|
|
|
| Process | Measures | Platform | Resource profile | Status |
|
|
|---------|----------|----------|-----------------|--------|
|
|
| `run-holdout` | Code quality against blind scenarios | Playwright + docker stack | Spawns Docker stack (containers + volumes + networks); requires Docker daemon; leaves ephemeral state until torn down | **Implemented** — `evaluate.sh` exists (harb #977) |
|
|
| `run-user-test` | UX quality across 5 personas | Playwright + docker stack | Spawns Docker stack (containers + volumes + networks); requires Docker daemon; leaves ephemeral state until torn down | **Implemented** — `run-usertest.sh` exists (harb #978) |
|
|
| `measure-resources` | Infra state (CPU, RAM, disk, containers) | System / DigitalOcean API | Read-only API calls. Safe to run anytime | **Planned** |
|
|
| `measure-protocol` | On-chain health (floor, reserves, volume) | Ponder GraphQL | Read-only API calls. Safe to run anytime | **Planned** |
|
|
| `collect-engagement` | Visitor engagement (visitors, referrers, pages) | Caddy access logs | Read-only log parsing. Safe to run anytime | **Implemented** — `site/collect-engagement.sh` (disinto #718) |
|
|
| `measure-funnel` | User conversion and retention | Analytics API | Read-only API calls. Safe to run anytime | **Planned** |
|
|
|
|
### Mutation processes (create change)
|
|
|
|
Will produce new artifacts. Consume significant resources. Results delivered via PR.
|
|
|
|
| Process | Produces | Consumes | Status |
|
|
|---------|----------|----------|--------|
|
|
| `run-evolution` | Better optimizer candidates (`.push3` programs) | CPU-heavy: transpile + compile + deploy + attack per candidate | **Implemented** — `evolve.sh` exists (harb #975) |
|
|
| `run-red-team` | Evidence (floor held?) + new attack vectors | CPU + RAM for revm evaluation | **Implemented** — `red-team.sh` exists (harb #976) |
|
|
|
|
### Feedback loops
|
|
|
|
Mutation processes will feed each other:
|
|
|
|
```
|
|
red-team discovers attack → new vector added to attacks/ via PR
|
|
→ evolution scores candidates against harder attacks
|
|
→ better optimizers survive
|
|
→ red-team runs again against improved candidates
|
|
```
|
|
|
|
The planner won't need to know this loop exists as a rule. It will emerge from evidence: "new attack vectors landed since last evolution run → evolution scores are stale → run evolution."
|
|
|
|
## Evidence Directory
|
|
|
|
> **Not yet created.** See harb #973 for the implementation issue.
|
|
|
|
```
|
|
evidence/
|
|
engagement/ # Visitor counts, referrers, page paths (from Caddy logs)
|
|
evolution/ # Run params, generation stats, best fitness, champion
|
|
red-team/ # Per-attack results, floor held/broken, ETH extracted
|
|
holdout/ # Per-scenario pass/fail, gate decision
|
|
user-test/ # Per-persona reports, friction points
|
|
resources/ # CPU, RAM, disk, container state
|
|
protocol/ # On-chain metrics from Ponder
|
|
funnel/ # Analytics conversion data (future)
|
|
```
|
|
|
|
Each file will be dated JSON. Machine-readable. Git history will show trends. The planner will diff against previous runs to detect improvement or regression.
|
|
|
|
## Delivery Pattern
|
|
|
|
Every process will follow the same delivery contract:
|
|
|
|
1. **Evidence** (metrics/reports) → committed to `evidence/` on main
|
|
2. **Artifacts** (code changes, new attack vectors, evolved programs) → PR
|
|
3. **Summary** → issue comment with key metrics and link to evidence file
|
|
|
|
## Evidence-Gated Deployment
|
|
|
|
Deployment will not be a human decision or a calendar event. It will be the natural consequence of all evidence dimensions being green:
|
|
|
|
- **Holdout:** 90% scenarios pass
|
|
- **Red-team:** Floor holds on all known attacks
|
|
- **User-test:** All personas complete journey, newcomers convert
|
|
- **Evolution:** Champion fitness above threshold
|
|
- **Protocol metrics:** ETH reserve growing, floor ratcheting up
|
|
- **Funnel:** Bounce rate below target, conversion above target
|
|
|
|
When all dimensions pass their thresholds, deployment becomes the obvious next action. Until then, the planner will know **which dimension is weakest** and focus resources there.
|
|
|
|
## Resource Allocation
|
|
|
|
The planner will optimize resource allocation across all processes. When the box is idle, it will find the highest-value use of compute based on evidence staleness and current gaps.
|
|
|
|
Pure-read sense processes (API queries, system stats) are cheap — run them freely to keep evidence fresh. Docker-based sense processes (`run-holdout`, `run-user-test`) are heavier: they spin up full stacks and should be scheduled when the box has capacity.
|
|
Mutation processes are expensive — run them when evidence justifies the cost.
|
|
|
|
The planner will read evidence recency and decide:
|
|
- "Red-team results are from before the VWAP fix → re-run"
|
|
- "User-tests haven't run since February → stale"
|
|
- "Evolution scored against 4 attacks but we now have 6 → outdated"
|
|
- "Box is idle, no CI running → good time for evolution"
|
|
|
|
No schedules. No hardcoded rules. The planner's judgment, informed by evidence.
|
|
|
|
## What Disinto Is Not
|
|
|
|
Disinto is not a general-purpose company operating system. It does not model arbitrary resources or business processes.
|
|
|
|
It is finely tuned for one thing: **money → software product → customer contact → knowledge → product improvement → market fit → more money.**
|
|
|
|
Every agent, process, and evidence type serves this loop.
|
|
|
|
## Related Issues
|
|
|
|
- harb #973 — Evidence directory structure
|
|
- harb #974 — Red-team attack vector auto-promotion
|
|
- harb #975 — `run-evolution` process
|
|
- harb #976 — `run-red-team` process
|
|
- harb #977 — `run-holdout` process
|
|
- harb #978 — `run-user-test` process
|
|
- disinto #139 — Action agent (process executor)
|
|
- disinto #140 — Prediction agent (evidence reader)
|
|
- disinto #142 — Planner triages predictions
|