From d8244742f1977a35f44a70d0b7df33cd1190031e Mon Sep 17 00:00:00 2001 From: openhands Date: Sat, 21 Mar 2026 00:22:37 +0000 Subject: [PATCH] =?UTF-8?q?fix:=20feat:=20supervisor=20as=20formula-driven?= =?UTF-8?q?=20agent=20=E2=80=94=20cron=20+=20Matrix=20escalation=20(#245)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.6 (1M context) --- AGENTS.md | 57 +++++++-- formulas/run-supervisor.toml | 237 +++++++++++++++++++++++++++++++++++ supervisor/journal/.gitkeep | 0 supervisor/preflight.sh | 188 +++++++++++++++++++++++++++ supervisor/supervisor-run.sh | 115 +++++++++++++++++ 5 files changed, 585 insertions(+), 12 deletions(-) create mode 100644 formulas/run-supervisor.toml create mode 100644 supervisor/journal/.gitkeep create mode 100755 supervisor/preflight.sh create mode 100755 supervisor/supervisor-run.sh diff --git a/AGENTS.md b/AGENTS.md index d518752..5e512dd 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -22,7 +22,10 @@ disinto/ ├── planner/ planner-run.sh — direct cron executor for run-planner formula │ planner/journal/ — daily raw logs from each planner run │ prediction-poll.sh, prediction-agent.sh — legacy predictor (superseded by predictor/) -├── supervisor/ supervisor-poll.sh — health monitoring +├── supervisor/ supervisor-run.sh — formula-driven health monitoring (cron wrapper) +│ preflight.sh — pre-flight data collection for supervisor formula +│ supervisor/journal/ — daily health logs from each run +│ supervisor-poll.sh — legacy bash orchestrator (superseded) ├── vault/ vault-poll.sh, vault-agent.sh, vault-fire.sh — action gating ├── action/ action-poll.sh, action-agent.sh — operational task execution ├── lib/ env.sh, agent-session.sh, ci-helpers.sh, ci-debug.sh, load-project.sh, parse-deps.sh, matrix_listener.sh @@ -138,26 +141,56 @@ issue is filed against). ### Supervisor (`supervisor/`) -**Role**: Health monitoring and auto-remediation. Two-layer architecture: -(1) factory infrastructure checks (RAM, disk, swap, docker, stale processes) -that run once, and (2) per-project checks (CI, PRs, dev-agent health, -circular deps, stale deps) that iterate over `projects/*.toml`. +**Role**: Health monitoring and auto-remediation, executed as a formula-driven +Claude agent. Collects system and project metrics via a bash pre-flight script, +then runs an interactive Claude session (sonnet) that assesses health, auto-fixes +issues, escalates via Matrix, and writes a daily journal. -**Trigger**: `supervisor-poll.sh` runs every 10 min via cron. +**Trigger**: `supervisor-run.sh` runs every 20 min via cron. It creates a tmux +session with `claude --model sonnet`, injects `formulas/run-supervisor.toml` +with pre-collected metrics as context, monitors the phase file, and cleans up +on completion or timeout (20 min max session). No action issues — the supervisor +runs directly from cron like the planner and predictor. **Key files**: -- `supervisor/supervisor-poll.sh` — All checks + auto-fixes (kill stale processes, rotate logs, drop caches, docker prune, abort stale rebases) then invokes `claude -p` for unresolved alerts -- `supervisor/update-prompt.sh` — Updates the supervisor prompt file -- `supervisor/PROMPT.md` — System prompt for the supervisor's Claude invocation +- `supervisor/supervisor-run.sh` — Cron wrapper + orchestrator: lock, memory guard, + runs preflight.sh, sources disinto project config, creates tmux session, injects + formula prompt with metrics, monitors phase file, handles crash recovery via + `run_formula_and_monitor` +- `supervisor/preflight.sh` — Data collection: system resources (RAM, disk, swap, + load), Docker status, active tmux sessions + phase files, lock files, agent log + tails, CI pipeline status, open PRs, issue counts, stale worktrees, pending + escalations, Matrix escalation replies +- `formulas/run-supervisor.toml` — Execution spec: five steps (preflight review, + health-assessment, decide-actions, report, journal) with `needs` dependencies. + Claude evaluates all metrics and takes actions in a single interactive session +- `supervisor/journal/*.md` — Daily health logs from each supervisor run (local, + committed periodically) +- `supervisor/PROMPT.md` — Best-practices reference for remediation actions +- `supervisor/best-practices/*.md` — Domain-specific remediation guides (memory, + disk, CI, git, dev-agent, review-agent, codeberg) +- `supervisor/supervisor-poll.sh` — Legacy bash orchestrator (superseded by + supervisor-run.sh + formula) **Alert priorities**: P0 (memory crisis), P1 (disk), P2 (factory stopped/stalled), P3 (degraded PRs, circular deps, stale deps), P4 (housekeeping). +**Matrix integration**: The supervisor has its own Matrix thread. Posts health +summaries when there are changes, escalates P0-P2 issues, and processes replies +from humans ("ignore disk warning", "kill that agent", "what's stuck?"). The +Matrix listener routes thread replies to `/tmp/supervisor-escalation-reply`, +which `supervisor-run.sh` consumes atomically on each run. + **Environment variables consumed**: -- All from `lib/env.sh` + per-project TOML overrides +- `CODEBERG_TOKEN`, `CODEBERG_REPO`, `CODEBERG_API`, `PROJECT_NAME`, `PROJECT_REPO_ROOT` +- `PRIMARY_BRANCH`, `CLAUDE_MODEL` (set to sonnet by supervisor-run.sh) - `WOODPECKER_TOKEN`, `WOODPECKER_SERVER`, `WOODPECKER_DB_PASSWORD`, `WOODPECKER_DB_USER`, `WOODPECKER_DB_HOST`, `WOODPECKER_DB_NAME` — CI database queries -- `CHECK_PRS`, `CHECK_DEV_AGENT`, `CHECK_PIPELINE_STALL` — Per-project monitoring toggles (from TOML `[monitoring]` section) -- `CHECK_INFRA_RETRY` — Infra failure retry toggle (env var only, defaults to `true`; not configurable via project TOML) +- `MATRIX_TOKEN`, `MATRIX_ROOM_ID`, `MATRIX_HOMESERVER` — Matrix notifications + human input + +**Lifecycle**: supervisor-run.sh (cron */20) → lock + memory guard → run +preflight.sh (collect metrics) → consume escalation replies → load formula + +context → create tmux session → Claude assesses health, auto-fixes, posts +Matrix summary, writes journal → `PHASE:done`. ### Planner (`planner/`) diff --git a/formulas/run-supervisor.toml b/formulas/run-supervisor.toml new file mode 100644 index 0000000..033f4a1 --- /dev/null +++ b/formulas/run-supervisor.toml @@ -0,0 +1,237 @@ +# formulas/run-supervisor.toml — Supervisor formula (health monitoring + remediation) +# +# Executed by supervisor/supervisor-run.sh via cron (every 20 minutes). +# supervisor-run.sh creates a tmux session with Claude (sonnet) and injects +# this formula with pre-collected metrics as context. +# +# Steps: preflight → health-assessment → decide-actions → report → journal +# +# Key differences from planner/gardener: +# - Runs every 20min — lightweight health check +# - Primarily READS state, rarely WRITES (no PRs, just Matrix + journal) +# - Reactive to escalations — processes pending escalation events +# - Conversation memory via Matrix thread and journal + +name = "run-supervisor" +description = "Factory health monitoring: assess metrics, fix issues, report via Matrix, write journal" +version = 1 +model = "sonnet" + +[context] +files = ["AGENTS.md"] + +[[steps]] +id = "preflight" +title = "Review pre-collected metrics" +description = """ +The pre-flight metrics have already been collected by supervisor/preflight.sh +and injected into your prompt above. Review them now. + +1. Read the injected metrics data carefully (System Resources, Docker, + Active Sessions, Phase Files, Lock Files, Agent Logs, CI Pipelines, + Open PRs, Issue Status, Stale Worktrees, Pending Escalations, + Escalation Replies). + +2. If there are escalation replies from Matrix (human messages), note them — + you will act on them in the decide-actions step. + +3. Read the supervisor journal for recent history: + JOURNAL_FILE="$FACTORY_ROOT/supervisor/journal/$(date -u +%Y-%m-%d).md" + if [ -f "$JOURNAL_FILE" ]; then cat "$JOURNAL_FILE"; fi + +4. Note any values that cross these thresholds: + - RAM available < 500MB or swap > 3GB → P0 (memory crisis) + - Disk > 80% → P1 (disk pressure) + - Agent sessions dead, CI stuck/pending, git in bad state → P2 (factory stopped) + - PRs stale, unreviewed, or with merge conflicts → P3 (factory degraded) + - Stale worktrees, old lock files → P4 (housekeeping) +""" + +[[steps]] +id = "health-assessment" +title = "Evaluate health of each subsystem" +description = """ +Categorize every finding from the metrics into priority levels. + +### P0 — Memory crisis +- RAM available < 500MB +- Swap used > 3GB AND RAM available < 2000MB + +### P1 — Disk pressure +- Disk usage > 80% + +### P2 — Factory stopped / stalled +- CI pipelines stuck running > 20min or pending > 30min +- Dev-agent lock file present but process dead +- Dev-agent status unchanged for > 30min +- Git repo on wrong branch or in broken rebase state +- Pipeline stalled: backlog issues exist but no agent ran for > 20min +- Dev-agent blocked: last N polls all report "no ready issues" +- Dev sessions in PHASE:needs_human for > 24h + +### P3 — Factory degraded +- PRs with CI pass but merge conflict (needs rebase) +- PRs with CI failure stale > 30min +- PRs with CI pass but no review for > 60min +- Circular dependency deadlocks in backlog +- Stale dependencies (blocked by issues open > 30 days) + +### P4 — Housekeeping +- Stale worktrees > 2h old with no active process +- Lock files for dead processes +- Stale claude processes (> 3h old) + +List each finding with its priority level. If everything looks healthy, +note "All systems healthy" and proceed. +""" +needs = ["preflight"] + +[[steps]] +id = "decide-actions" +title = "Fix what you can, escalate what you cannot" +description = """ +For each finding from the health assessment, decide and execute an action. + +### Auto-fixable (execute these directly) + +**P0 Memory crisis:** + # Kill stale one-shot claude processes (>3h old) + pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true + # Drop filesystem caches + sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null 2>&1 || true + +**P1 Disk pressure:** + # Docker cleanup + sudo docker system prune -f >/dev/null 2>&1 || true + # Truncate logs > 10MB + for f in "$FACTORY_ROOT"/{dev,review,supervisor,gardener,planner,predictor}/*.log; do + [ -f "$f" ] && [ "$(du -k "$f" | cut -f1)" -gt 10240 ] && truncate -s 0 "$f" + done + +**P2 Dead lock files:** + rm -f /path/to/stale.lock + +**P2 Stale rebase:** + cd "$PROJECT_REPO_ROOT" + git rebase --abort 2>/dev/null + git checkout "$PRIMARY_BRANCH" 2>/dev/null + +**P2 Wrong branch:** + cd "$PROJECT_REPO_ROOT" + git checkout "$PRIMARY_BRANCH" 2>/dev/null + +**P4 Stale worktrees:** + git -C "$PROJECT_REPO_ROOT" worktree remove --force /tmp/stale-worktree 2>/dev/null + git -C "$PROJECT_REPO_ROOT" worktree prune 2>/dev/null + +**P4 Stale claude processes:** + pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true + +### Escalation replies (from Matrix) + +If there are escalation replies from a human, act on them: +- "ignore X" → note in journal, do not alert on X this run +- "kill that agent" → identify and kill the referenced session +- "what's stuck?" → include detailed status in the Matrix report +- Other instructions → follow them, use best judgment + +### Cannot auto-fix → escalate + +For P0-P2 issues that persist after auto-fix attempts, or issues requiring +human judgment, prepare an escalation message for the report step. + +Read the relevant best-practices file before taking action: + cat "$FACTORY_ROOT/supervisor/best-practices/memory.md" # P0 + cat "$FACTORY_ROOT/supervisor/best-practices/disk.md" # P1 + cat "$FACTORY_ROOT/supervisor/best-practices/ci.md" # P2 CI + cat "$FACTORY_ROOT/supervisor/best-practices/dev-agent.md" # P2 agent + cat "$FACTORY_ROOT/supervisor/best-practices/git.md" # P2 git + +Track what you fixed and what needs escalation for the report step. +""" +needs = ["health-assessment"] + +[[steps]] +id = "report" +title = "Post health summary to Matrix" +description = """ +Post a status summary to Matrix. Use the matrix_send function: + source "$FACTORY_ROOT/lib/env.sh" + matrix_send "supervisor" "" + +### When everything is healthy +Post a brief "all clear" only if the PREVIOUS run had alerts (check journal). +Do NOT post "all clear" every 20 minutes — that would be noise. + +### When there are findings +Post a summary grouped by priority: + matrix_send "supervisor" "Supervisor health check: + + Fixed: + - + + Alerts: + - [P2] + - [P3] + + Status: RAM=MB Disk=% Load=" + +### When escalation is needed (P0-P2 unresolved) +Escalate with a clear call to action: + matrix_send "supervisor" "ESCALATE: + + Suggested action: " + +### Responding to escalation replies +If you acted on a human's reply, confirm what you did: + matrix_send "supervisor" "Acted on your reply: " + +Keep messages concise. Do not post identical messages to what was posted +in the previous run (check journal for prior messages). +""" +needs = ["decide-actions"] + +[[steps]] +id = "journal" +title = "Write health journal entry" +description = """ +Append a timestamped entry to the supervisor journal. + +File path: + $FACTORY_ROOT/supervisor/journal/$(date -u +%Y-%m-%d).md + +If the file already exists (multiple runs per day), append a new section. +If it does not exist, create it. + +Format: + ## Supervisor run — HH:MM UTC + + ### Health status + - RAM: MB available, Swap: MB + - Disk: % + - Load: + - Docker: containers + + ### Findings + - [P] + (or "No issues found — all systems healthy") + + ### Actions taken + - + (or "No actions needed") + + ### Escalation replies processed + - + (or "None") + +Keep each entry concise — 15-25 lines max. This journal provides +run-to-run context so future supervisor runs can detect trends +(e.g., "disk has been >75% for 3 consecutive runs"). + +IMPORTANT: Do NOT commit or push the journal — it is a local working file. +The journal directory is committed to git periodically by other agents. + +After writing the journal, write the phase signal: + echo 'PHASE:done' > '$PHASE_FILE' +""" +needs = ["report"] diff --git a/supervisor/journal/.gitkeep b/supervisor/journal/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/supervisor/preflight.sh b/supervisor/preflight.sh new file mode 100755 index 0000000..ee3f070 --- /dev/null +++ b/supervisor/preflight.sh @@ -0,0 +1,188 @@ +#!/usr/bin/env bash +# ============================================================================= +# preflight.sh — Collect system and project metrics for the supervisor formula +# +# Outputs structured text to stdout. Called by supervisor-run.sh before +# launching the Claude session. The output is injected into the prompt. +# +# Usage: +# bash supervisor/preflight.sh [projects/disinto.toml] +# ============================================================================= +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +FACTORY_ROOT="$(dirname "$SCRIPT_DIR")" + +export PROJECT_TOML="${1:-$FACTORY_ROOT/projects/disinto.toml}" +# shellcheck source=../lib/env.sh +source "$FACTORY_ROOT/lib/env.sh" +# shellcheck source=../lib/ci-helpers.sh +source "$FACTORY_ROOT/lib/ci-helpers.sh" + +# ── System Resources ───────────────────────────────────────────────────── + +echo "## System Resources" + +_avail_mb=$(free -m | awk '/Mem:/{print $7}') +_total_mb=$(free -m | awk '/Mem:/{print $2}') +_swap_used=$(free -m | awk '/Swap:/{print $3}') +_disk_pct=$(df -h / | awk 'NR==2{print $5}' | tr -d '%') +_disk_used=$(df -h / | awk 'NR==2{print $3}') +_disk_total=$(df -h / | awk 'NR==2{print $2}') +_load=$(cat /proc/loadavg 2>/dev/null || echo "unknown") + +echo "RAM: ${_avail_mb}MB available / ${_total_mb}MB total, Swap: ${_swap_used}MB used" +echo "Disk: ${_disk_pct}% used (${_disk_used}/${_disk_total} on /)" +echo "Load: ${_load}" +echo "" + +# ── Docker ──────────────────────────────────────────────────────────────── + +echo "## Docker" +if command -v docker &>/dev/null; then + docker ps --format 'table {{.Names}}\t{{.Status}}' 2>/dev/null || echo "Docker query failed" +else + echo "Docker not available" +fi +echo "" + +# ── Active Sessions + Phase Files ───────────────────────────────────────── + +echo "## Active Sessions" +if tmux list-sessions 2>/dev/null; then + : +else + echo "No tmux sessions" +fi +echo "" + +echo "## Phase Files" +_found_phase=false +for _pf in /tmp/*-session-*.phase; do + [ -f "$_pf" ] || continue + _found_phase=true + _phase_content=$(head -1 "$_pf" 2>/dev/null || echo "unreadable") + _phase_age_min=$(( ($(date +%s) - $(stat -c %Y "$_pf" 2>/dev/null || echo 0)) / 60 )) + echo " $(basename "$_pf"): ${_phase_content} (${_phase_age_min}min ago)" +done +[ "$_found_phase" = false ] && echo " None" +echo "" + +# ── Lock Files ──────────────────────────────────────────────────────────── + +echo "## Lock Files" +_found_lock=false +for _lf in /tmp/*-poll.lock /tmp/*-run.lock /tmp/dev-agent-*.lock; do + [ -f "$_lf" ] || continue + _found_lock=true + _pid=$(cat "$_lf" 2>/dev/null || true) + _age_min=$(( ($(date +%s) - $(stat -c %Y "$_lf" 2>/dev/null || echo 0)) / 60 )) + _alive="dead" + [ -n "${_pid:-}" ] && kill -0 "$_pid" 2>/dev/null && _alive="alive" + echo " $(basename "$_lf"): PID=${_pid:-?} ${_alive} age=${_age_min}min" +done +[ "$_found_lock" = false ] && echo " None" +echo "" + +# ── Agent Logs (last 15 lines each) ────────────────────────────────────── + +echo "## Recent Agent Logs" +for _log in supervisor/supervisor.log dev/dev-agent.log review/review.log \ + gardener/gardener.log planner/planner.log predictor/predictor.log \ + action/action.log; do + _logpath="${FACTORY_ROOT}/${_log}" + if [ -f "$_logpath" ]; then + _log_age_min=$(( ($(date +%s) - $(stat -c %Y "$_logpath" 2>/dev/null || echo 0)) / 60 )) + echo "### ${_log} (last modified ${_log_age_min}min ago)" + tail -15 "$_logpath" 2>/dev/null || echo "(read failed)" + echo "" + fi +done + +# ── CI Pipelines ────────────────────────────────────────────────────────── + +echo "## CI Pipelines (${PROJECT_NAME})" + +_recent_ci=$(wpdb -A -c " + SELECT number, status, branch, + ROUND(EXTRACT(EPOCH FROM (to_timestamp(finished) - to_timestamp(started)))/60)::int as dur_min + FROM pipelines + WHERE repo_id = ${WOODPECKER_REPO_ID} + AND finished > 0 + AND to_timestamp(finished) > now() - interval '24 hours' + ORDER BY number DESC LIMIT 10;" 2>/dev/null || echo "CI database query failed") +echo "$_recent_ci" + +_stuck=$(wpdb -c " + SELECT count(*) FROM pipelines + WHERE repo_id=${WOODPECKER_REPO_ID} + AND status='running' + AND EXTRACT(EPOCH FROM now() - to_timestamp(started)) > 1200;" 2>/dev/null | xargs || echo "?") + +_pending=$(wpdb -c " + SELECT count(*) FROM pipelines + WHERE repo_id=${WOODPECKER_REPO_ID} + AND status='pending' + AND EXTRACT(EPOCH FROM now() - to_timestamp(created)) > 1800;" 2>/dev/null | xargs || echo "?") + +echo "Stuck (>20min): ${_stuck}" +echo "Pending (>30min): ${_pending}" +echo "" + +# ── Open PRs ────────────────────────────────────────────────────────────── + +echo "## Open PRs (${PROJECT_NAME})" +_open_prs=$(codeberg_api GET "/pulls?state=open&limit=10" 2>/dev/null || echo "[]") +echo "$_open_prs" | jq -r '.[] | "#\(.number) [\(.head.ref)] \(.title) — updated \(.updated_at)"' 2>/dev/null || echo "No open PRs or query failed" +echo "" + +# ── Backlog + In-Progress ───────────────────────────────────────────────── + +echo "## Issue Status (${PROJECT_NAME})" +_backlog_count=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq 'length' 2>/dev/null || echo "?") +_in_progress_count=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq 'length' 2>/dev/null || echo "?") +_blocked_count=$(codeberg_api GET "/issues?state=open&labels=blocked&type=issues&limit=1" 2>/dev/null | jq 'length' 2>/dev/null || echo "?") +echo "Backlog: ${_backlog_count}, In-progress: ${_in_progress_count}, Blocked: ${_blocked_count}" +echo "" + +# ── Stale Worktrees ─────────────────────────────────────────────────────── + +echo "## Stale Worktrees" +_found_wt=false +for _wt in /tmp/*-worktree-* /tmp/*-review-*; do + [ -d "$_wt" ] || continue + _found_wt=true + _wt_age_min=$(( ($(date +%s) - $(stat -c %Y "$_wt" 2>/dev/null || echo 0)) / 60 )) + echo " $(basename "$_wt"): ${_wt_age_min}min old" +done +[ "$_found_wt" = false ] && echo " None" +echo "" + +# ── Pending Escalations ────────────────────────────────────────────────── + +echo "## Pending Escalations" +_found_esc=false +for _esc_file in "${FACTORY_ROOT}/supervisor/escalations-"*.jsonl; do + [ -f "$_esc_file" ] || continue + [[ "$_esc_file" == *.done.jsonl ]] && continue + _esc_count=$(wc -l < "$_esc_file" 2>/dev/null || echo 0) + [ "${_esc_count:-0}" -gt 0 ] || continue + _found_esc=true + echo "### $(basename "$_esc_file") (${_esc_count} entries)" + cat "$_esc_file" + echo "" +done +[ "$_found_esc" = false ] && echo " None" +echo "" + +# ── Escalation Replies from Matrix ──────────────────────────────────────── + +echo "## Escalation Replies (from Matrix)" +if [ -s /tmp/supervisor-escalation-reply ]; then + cat /tmp/supervisor-escalation-reply + echo "" + echo "(This reply will be consumed after this run)" +else + echo " None" +fi +echo "" diff --git a/supervisor/supervisor-run.sh b/supervisor/supervisor-run.sh new file mode 100755 index 0000000..26494c5 --- /dev/null +++ b/supervisor/supervisor-run.sh @@ -0,0 +1,115 @@ +#!/usr/bin/env bash +# ============================================================================= +# supervisor-run.sh — Cron wrapper: supervisor execution via Claude + formula +# +# Runs every 20 minutes (or on-demand). Guards against concurrent runs and +# low memory. Collects metrics via preflight.sh, then creates a tmux session +# with Claude (sonnet) reading formulas/run-supervisor.toml. +# +# Replaces supervisor-poll.sh (bash orchestrator + claude -p one-shot) with +# formula-driven interactive Claude session matching the planner/predictor +# pattern. +# +# Usage: +# supervisor-run.sh [projects/disinto.toml] # project config (default: disinto) +# +# Cron: */20 * * * * cd /path/to/dark-factory && bash supervisor/supervisor-run.sh +# ============================================================================= +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +FACTORY_ROOT="$(dirname "$SCRIPT_DIR")" + +# Accept project config from argument; default to disinto +export PROJECT_TOML="${1:-$FACTORY_ROOT/projects/disinto.toml}" +# shellcheck source=../lib/env.sh +source "$FACTORY_ROOT/lib/env.sh" +# shellcheck source=../lib/agent-session.sh +source "$FACTORY_ROOT/lib/agent-session.sh" +# shellcheck source=../lib/formula-session.sh +source "$FACTORY_ROOT/lib/formula-session.sh" + +LOG_FILE="$SCRIPT_DIR/supervisor.log" +# shellcheck disable=SC2034 # consumed by run_formula_and_monitor +SESSION_NAME="supervisor-${PROJECT_NAME}" +PHASE_FILE="/tmp/supervisor-session-${PROJECT_NAME}.phase" + +# shellcheck disable=SC2034 # read by monitor_phase_loop in lib/agent-session.sh +PHASE_POLL_INTERVAL=15 + +SCRATCH_FILE="/tmp/supervisor-${PROJECT_NAME}-scratch.md" + +log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%S)Z] $*" >> "$LOG_FILE"; } + +# ── Guards ──────────────────────────────────────────────────────────────── +acquire_cron_lock "/tmp/supervisor-run.lock" +check_memory 2000 + +log "--- Supervisor run start ---" + +# ── Collect pre-flight metrics ──────────────────────────────────────────── +log "Running preflight.sh" +PREFLIGHT_OUTPUT="" +if PREFLIGHT_OUTPUT=$(bash "$SCRIPT_DIR/preflight.sh" "$PROJECT_TOML" 2>&1); then + log "Preflight collected ($(echo "$PREFLIGHT_OUTPUT" | wc -l) lines)" +else + log "WARNING: preflight.sh failed, continuing with partial data" +fi + +# ── Consume escalation replies ──────────────────────────────────────────── +# Move the file atomically so matrix_listener can write a new one +ESCALATION_REPLY="" +if [ -s /tmp/supervisor-escalation-reply ]; then + _reply_tmp="/tmp/supervisor-escalation-reply.consumed.$$" + if mv /tmp/supervisor-escalation-reply "$_reply_tmp" 2>/dev/null; then + ESCALATION_REPLY=$(cat "$_reply_tmp") + rm -f "$_reply_tmp" + log "Consumed escalation reply: $(echo "$ESCALATION_REPLY" | head -1)" + fi +fi + +# ── Load formula + context ─────────────────────────────────────────────── +load_formula "$FACTORY_ROOT/formulas/run-supervisor.toml" +build_context_block AGENTS.md + +# ── Read scratch file (compaction survival) ─────────────────────────────── +SCRATCH_CONTEXT=$(read_scratch_context "$SCRATCH_FILE") +SCRATCH_INSTRUCTION=$(build_scratch_instruction "$SCRATCH_FILE") + +# ── Build prompt ───────────────────────────────────────────────────────── +build_prompt_footer + +# shellcheck disable=SC2034 # consumed by run_formula_and_monitor +PROMPT="You are the supervisor agent for ${CODEBERG_REPO}. Work through the formula below. You MUST write PHASE:done to '${PHASE_FILE}' when finished — the orchestrator will time you out if you return to the prompt without signalling. + +You have full shell access and --dangerously-skip-permissions. +Fix what you can. Escalate what you cannot. Do NOT ask permission — act first, report after. + +## Pre-flight metrics (collected $(date -u +%H:%M) UTC) +${PREFLIGHT_OUTPUT} +${ESCALATION_REPLY:+ +## Escalation Reply (from Matrix — human message) +${ESCALATION_REPLY} + +Act on this reply in the decide-actions step. +} +## Project context +${CONTEXT_BLOCK} +${SCRATCH_CONTEXT:+${SCRATCH_CONTEXT} +} +## Formula +${FORMULA_CONTENT} + +${SCRATCH_INSTRUCTION} + +${PROMPT_FOOTER}" + +# ── Run session ────────────────────────────────────────────────────────── +export CLAUDE_MODEL="sonnet" +run_formula_and_monitor "supervisor" 1200 + +# ── Cleanup scratch file on normal exit ────────────────────────────────── +# FINAL_PHASE already set by run_formula_and_monitor +if [ "${FINAL_PHASE:-}" = "PHASE:done" ]; then + rm -f "$SCRATCH_FILE" +fi