2026-04-16 02:15:38 +00:00
<!-- last - reviewed: c363ee0aea2ae447daab28c2c850d6abefc8c6b5 -->
2026-03-21 12:44:23 +00:00
# Supervisor Agent
**Role**: Health monitoring and auto-remediation, executed as a formula-driven
Claude agent. Collects system and project metrics via a bash pre-flight script,
then runs an interactive Claude session (sonnet) that assesses health, auto-fixes
fix: Remove Matrix integration — notifications move to forge + OpenClaw (#732)
Remove all Matrix/Dendrite infrastructure:
- Delete lib/matrix_listener.sh (long-poll daemon), lib/matrix_listener.service
(systemd unit), lib/hooks/on-stop-matrix.sh (response streaming hook)
- Remove matrix_send() and matrix_send_ctx() from lib/env.sh
- Remove MATRIX_HOMESERVER auto-detection, MATRIX_THREAD_MAP from lib/env.sh
- Remove [matrix] section parsing from lib/load-project.sh
- Remove Matrix hook installation from lib/agent-session.sh
- Remove notify/notify_ctx helpers and Matrix thread tracking from
dev/dev-agent.sh and action/action-agent.sh
- Remove all matrix_send calls from dev-poll.sh, phase-handler.sh,
action-poll.sh, vault-poll.sh, vault-fire.sh, vault-reject.sh,
review-poll.sh, review-pr.sh, supervisor-poll.sh, formula-session.sh
- Remove Matrix listener startup from docker/agents/entrypoint.sh
- Remove append_dendrite_compose() and setup_matrix() from bin/disinto
- Remove --matrix flag from disinto init
- Clean Matrix references from .env.example, projects/*.toml.example,
formulas/*.toml, AGENTS.md, BOOTSTRAP.md, README.md, RESOURCES.md,
PHASE-PROTOCOL.md, and all agent AGENTS.md/PROMPT.md files
Status visibility now via Codeberg PR/issue activity. Human interaction
via vault items through forge. Proactive alerts via OpenClaw heartbeats.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 14:53:56 +00:00
issues, and writes a daily journal. When blocked on external
2026-03-26 10:32:04 +00:00
resources or human decisions, files vault items instead of escalating directly.
2026-03-21 12:44:23 +00:00
2026-04-16 02:15:38 +00:00
**Trigger**: `supervisor-run.sh` is invoked by two polling loops:
- **Agents container** (`docker/agents/entrypoint.sh` ): every `SUPERVISOR_INTERVAL` seconds (default 1200 = 20 min). Controlled by the `supervisor` role in `AGENT_ROLES` (included in the default seven-role set since P1/#801 ). Logs to `supervisor.log` in the agents container.
- **Edge container** (`docker/edge/entrypoint-edge.sh` ): separate loop in the edge container (line 169-172). Runs independently of the agents container's polling schedule.
Both invoke the same `supervisor-run.sh` . Sources `lib/guard.sh` and calls `check_active supervisor` first — skips if `$FACTORY_ROOT/state/.supervisor-active` is absent. Then runs `claude -p` via `agent-sdk.sh` , injects `formulas/run-supervisor.toml` with pre-collected metrics as context, and cleans up on completion or timeout.
2026-03-21 12:44:23 +00:00
**Key files**:
2026-04-10 08:35:19 +00:00
- `supervisor/supervisor-run.sh` — Polling loop participant + orchestrator: lock, memory guard,
2026-04-07 08:55:31 +00:00
runs preflight.sh, sources disinto project config, runs claude -p via agent-sdk.sh,
injects formula prompt with metrics, handles crash recovery
2026-03-21 12:44:23 +00:00
- `supervisor/preflight.sh` — Data collection: system resources (RAM, disk, swap,
2026-04-07 08:55:31 +00:00
load), Docker status, active sessions + phase files, lock files, agent log
2026-03-21 12:44:23 +00:00
tails, CI pipeline status, open PRs, issue counts, stale worktrees, blocked
2026-03-26 10:32:04 +00:00
issues. Also performs **stale phase cleanup** : scans `/tmp/*-session-*.phase`
2026-03-26 10:40:16 +00:00
files for `PHASE:escalate` entries and auto-removes any whose linked issue
2026-03-26 18:14:35 +00:00
is confirmed closed (24h grace period after closure to avoid races). Reports
**stale crashed worktrees** (worktrees preserved after crash) — supervisor
housekeeping removes them after 24h
2026-03-21 12:44:23 +00:00
- `formulas/run-supervisor.toml` — Execution spec: five steps (preflight review,
health-assessment, decide-actions, report, journal) with `needs` dependencies.
Claude evaluates all metrics and takes actions in a single interactive session
fix: {project}-ops repo — separate operations from code (#757) (#767)
Fixes #757
## Changes
Separate operations from code into {project}-ops repo pattern. Added OPS_REPO_ROOT infrastructure (env.sh, load-project.sh, formula-session.sh with ensure_ops_repo helper). Updated all 8 agent scripts and 7 formulas to read/write vault items, journals, evidence, prerequisites, RESOURCES.md, and knowledge from the ops repo. Added setup_ops_repo() to disinto init for automatic ops repo creation and seeding. Removed migrated data from code repo (vault data dirs, planner journal/memory/prerequisites, supervisor journal/best-practices, evidence, RESOURCES.md). Updated all documentation. 55 files changed, ShellCheck clean, all 38 phase tests pass.
Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/767
Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
2026-03-26 19:55:12 +01:00
- `$OPS_REPO_ROOT/knowledge/*.md` — Domain-specific remediation guides (memory,
2026-03-23 18:05:26 +00:00
disk, CI, git, dev-agent, review-agent, forge)
2026-03-21 12:44:23 +00:00
**Alert priorities**: P0 (memory crisis), P1 (disk), P2 (factory stopped/stalled),
P3 (degraded PRs, circular deps, stale deps), P4 (housekeeping).
**Environment variables consumed**:
fix: {project}-ops repo — separate operations from code (#757) (#767)
Fixes #757
## Changes
Separate operations from code into {project}-ops repo pattern. Added OPS_REPO_ROOT infrastructure (env.sh, load-project.sh, formula-session.sh with ensure_ops_repo helper). Updated all 8 agent scripts and 7 formulas to read/write vault items, journals, evidence, prerequisites, RESOURCES.md, and knowledge from the ops repo. Added setup_ops_repo() to disinto init for automatic ops repo creation and seeding. Removed migrated data from code repo (vault data dirs, planner journal/memory/prerequisites, supervisor journal/best-practices, evidence, RESOURCES.md). Updated all documentation. 55 files changed, ShellCheck clean, all 38 phase tests pass.
Co-authored-by: openhands <openhands@all-hands.dev>
Reviewed-on: https://codeberg.org/johba/disinto/pulls/767
Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
2026-03-26 19:55:12 +01:00
- `FORGE_TOKEN` , `FORGE_SUPERVISOR_TOKEN` (falls back to FORGE_TOKEN), `FORGE_REPO` , `FORGE_API` , `PROJECT_NAME` , `PROJECT_REPO_ROOT` , `OPS_REPO_ROOT`
2026-03-21 12:44:23 +00:00
- `PRIMARY_BRANCH` , `CLAUDE_MODEL` (set to sonnet by supervisor-run.sh)
2026-04-16 02:15:38 +00:00
- `SUPERVISOR_INTERVAL` — polling interval in seconds for agents container (default 1200 = 20 min)
2026-03-21 12:44:23 +00:00
- `WOODPECKER_TOKEN` , `WOODPECKER_SERVER` , `WOODPECKER_DB_PASSWORD` , `WOODPECKER_DB_USER` , `WOODPECKER_DB_HOST` , `WOODPECKER_DB_NAME` — CI database queries
fix: bug: supervisor hardcodes ops repo expectation — fails silently on deployments without one (#544)
Add OPS repo presence detection in supervisor-run.sh with degraded mode support:
- Detect if OPS_REPO_ROOT is missing and log WARNING message
- Set OPS_REPO_DEGRADED=1 flag and configure fallback paths
- Bundle minimal knowledge files as fallback for degraded mode
- Update formula to use OPS_KNOWLEDGE_ROOT, OPS_JOURNAL_ROOT, OPS_VAULT_ROOT
- Support local vault destination and journal fallback when ops repo absent
Knowledge files bundled: disk.md, memory.md, ci.md, git.md, dev-agent.md,
review-agent.md, forge.md
The supervisor now runs with full functionality when ops repo is available,
or gracefully degrades to local paths when absent, making the failure mode
explicit rather than silent.
2026-04-10 08:16:03 +00:00
**Degraded mode (Issue #544 )**: When `OPS_REPO_ROOT` is not set or the directory doesn't exist, the supervisor runs in degraded mode:
- Uses bundled knowledge files from `$FACTORY_ROOT/knowledge/` instead of ops repo playbooks
- Writes journal locally to `$FACTORY_ROOT/state/supervisor-journal/` (not committed to git)
- Files vault items locally to `$PROJECT_REPO_ROOT/vault/pending/`
- Logs a WARNING message at startup indicating degraded mode
2026-04-10 08:35:19 +00:00
**Lifecycle**: supervisor-run.sh (invoked by polling loop every 20min, `check_active supervisor` )
→ lock + memory guard → run preflight.sh (collect metrics) → load formula + context → run
claude -p via agent-sdk.sh → Claude assesses health, auto-fixes, writes journal → `PHASE:done` .