openhands eb0ee66c8f fix: Clarify sense-process resource profile — 'change nothing' is inaccurate for docker-based processes (#229 )

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-20 17:00:21 +00:00

8.8 KiB

Raw Blame History

Evidence Architecture — Roadmap

Status: Partially Implemented — This document describes the target evidence architecture. Items marked Implemented exist in the codebase; items marked Partial have upstream scripts but no evidence output yet; all others are Planned. See AGENTS.md for the current operational state.

Disinto is purpose-built for one loop: build software, launch it, improve it, reach market fit.

This document describes how autonomous agents will sense the world, produce evidence, and use that evidence to make decisions — from "which issue to work on next" to "is this ready to deploy."

The Loop

build → measure → evidence good enough?
  no  → improve → build again
  yes → deploy → measure in-market → evidence still good?
    no  → improve → build again
    yes → expand

Every decision in this loop will be driven by evidence, not intuition. The planner will read structured evidence across all dimensions, identify the weakest one, and focus there.

Evidence as Integration Layer

Different domains have different platforms:

Domain	Platform	What it tracks	Status
Code	Codeberg	Issues, PRs, reviews	Implemented — Live
CI/CD	Woodpecker	Build/test results	Implemented — Live
Protocol	Ponder / GraphQL	On-chain state, trades, positions	Partial — Live (not yet wired to evidence)
Infrastructure	DigitalOcean / system stats	CPU, RAM, disk, containers	Planned — Supervisor monitors, no evidence output yet
User experience	Playwright personas	Conversion, friction, journey completion	Partial — Scripts exist (`run-usertest.sh`), no evidence output yet
Funnel	Analytics (future)	Bounce rate, conversion, retention	Planned — Not started

Agents won't need to understand each platform. Processes act as adapters — they will read a platform's API and write structured evidence to git.

[Google Analytics] ──→ measure-funnel process ──→ evidence/funnel/YYYY-MM-DD.json
[Ponder GraphQL]  ──→ measure-protocol process ──→ evidence/protocol/YYYY-MM-DD.json
[System stats]    ──→ measure-resources process ──→ evidence/resources/YYYY-MM-DD.json
[Playwright]      ──→ run-user-test process ──→ evidence/user-test/YYYY-MM-DD.json

The planner will read evidence/ — not Analytics, not Ponder, not DigitalOcean. Evidence is the normalized interface between the world and decisions.

Terminology note — "process" vs "formula": In this document, "process" means a self-contained measurement or mutation pipeline that reads an external platform and writes structured evidence to git. This is distinct from disinto's "formulas" (formulas/*.toml), which are TOML issue templates that guide agents through multi-step operational work (see AGENTS.md § Directory layout). Processes produce evidence; formulas orchestrate agent tasks.

Process Types

Sense processes

Produce evidence without modifying the project under test. Some sense processes are pure reads (API calls, system stats); others — run-holdout and run-user-test — spawn a Docker stack (containers, volumes, networks) that requires the Docker daemon and leaves ephemeral state on the host until explicitly torn down. These are not safe to treat as no-op: they consume resources and mutate host-level Docker state.

Process	Measures	Platform	Resource profile	Status
`run-holdout`	Code quality against blind scenarios	Playwright + docker stack	Spawns Docker stack (containers + volumes + networks); requires Docker daemon; leaves ephemeral state until torn down	Implemented — `evaluate.sh` exists (harb #977)
`run-user-test`	UX quality across 5 personas	Playwright + docker stack	Spawns Docker stack (containers + volumes + networks); requires Docker daemon; leaves ephemeral state until torn down	Implemented — `run-usertest.sh` exists (harb #978)
`measure-resources`	Infra state (CPU, RAM, disk, containers)	System / DigitalOcean API	Read-only API calls. Safe to run anytime	Planned
`measure-protocol`	On-chain health (floor, reserves, volume)	Ponder GraphQL	Read-only API calls. Safe to run anytime	Planned
`measure-funnel`	User conversion and retention	Analytics API	Read-only API calls. Safe to run anytime	Planned

Mutation processes (create change)

Will produce new artifacts. Consume significant resources. Results delivered via PR.

Process	Produces	Consumes	Status
`run-evolution`	Better optimizer candidates (`.push3` programs)	CPU-heavy: transpile + compile + deploy + attack per candidate	Implemented — `evolve.sh` exists (harb #975)
`run-red-team`	Evidence (floor held?) + new attack vectors	CPU + RAM for revm evaluation	Implemented — `red-team.sh` exists (harb #976)

Feedback loops

Mutation processes will feed each other:

red-team discovers attack → new vector added to attacks/ via PR
  → evolution scores candidates against harder attacks
    → better optimizers survive
      → red-team runs again against improved candidates

The planner won't need to know this loop exists as a rule. It will emerge from evidence: "new attack vectors landed since last evolution run → evolution scores are stale → run evolution."

Evidence Directory

Not yet created. See harb #973 for the implementation issue.

evidence/
  evolution/        # Run params, generation stats, best fitness, champion
  red-team/         # Per-attack results, floor held/broken, ETH extracted
  holdout/          # Per-scenario pass/fail, gate decision
  user-test/        # Per-persona reports, friction points
  resources/        # CPU, RAM, disk, container state
  protocol/         # On-chain metrics from Ponder
  funnel/           # Analytics conversion data (future)

Each file will be dated JSON. Machine-readable. Git history will show trends. The planner will diff against previous runs to detect improvement or regression.

Delivery Pattern

Every process will follow the same delivery contract:

Evidence (metrics/reports) → committed to evidence/ on main
Artifacts (code changes, new attack vectors, evolved programs) → PR
Summary → issue comment with key metrics and link to evidence file

Evidence-Gated Deployment

Deployment will not be a human decision or a calendar event. It will be the natural consequence of all evidence dimensions being green:

Holdout: 90% scenarios pass
Red-team: Floor holds on all known attacks
User-test: All personas complete journey, newcomers convert
Evolution: Champion fitness above threshold
Protocol metrics: ETH reserve growing, floor ratcheting up
Funnel: Bounce rate below target, conversion above target

When all dimensions pass their thresholds, deployment becomes the obvious next action. Until then, the planner will know which dimension is weakest and focus resources there.

Resource Allocation

The planner will optimize resource allocation across all processes. When the box is idle, it will find the highest-value use of compute based on evidence staleness and current gaps.

Pure-read sense processes (API queries, system stats) are cheap — run them freely to keep evidence fresh. Docker-based sense processes (run-holdout, run-user-test) are heavier: they spin up full stacks and should be scheduled when the box has capacity. Mutation processes are expensive — run them when evidence justifies the cost.

The planner will read evidence recency and decide:

"Red-team results are from before the VWAP fix → re-run"
"User-tests haven't run since February → stale"
"Evolution scored against 4 attacks but we now have 6 → outdated"
"Box is idle, no CI running → good time for evolution"

No schedules. No hardcoded rules. The planner's judgment, informed by evidence.

What Disinto Is Not

Disinto is not a general-purpose company operating system. It does not model arbitrary resources or business processes.

It is finely tuned for one thing: money → software product → customer contact → knowledge → product improvement → market fit → more money.

Every agent, process, and evidence type serves this loop.

harb #973 — Evidence directory structure
harb #974 — Red-team attack vector auto-promotion
harb #975 — run-evolution process
harb #976 — run-red-team process
harb #977 — run-holdout process
harb #978 — run-user-test process
disinto #139 — Action agent (process executor)
disinto #140 — Prediction agent (evidence reader)
disinto #142 — Planner triages predictions

8.8 KiB Raw Blame History