disinto/formulas/run-supervisor.toml
openhands b60811f0a1 fix: supervisor should clean up stale PHASE:escalate files for closed issues (#664)
Add auto-cleanup to supervisor/preflight.sh: PHASE:escalate files whose
parent issue/PR is confirmed closed (via Forge API) are deleted after a
24h grace period. Cleanup results appear in the preflight output for
journal logging by the supervisor formula.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 08:34:22 +00:00

257 lines
9.2 KiB
TOML

# formulas/run-supervisor.toml — Supervisor formula (health monitoring + remediation)
#
# Executed by supervisor/supervisor-run.sh via cron (every 20 minutes).
# supervisor-run.sh creates a tmux session with Claude (sonnet) and injects
# this formula with pre-collected metrics as context.
#
# Steps: preflight → health-assessment → decide-actions → report → journal
#
# Key differences from planner/gardener:
# - Runs every 20min — lightweight health check
# - Primarily READS state, rarely WRITES (no PRs, just Matrix + journal)
# - Reactive to escalations — processes pending escalation events
# - Conversation memory via Matrix thread and journal
name = "run-supervisor"
description = "Factory health monitoring: assess metrics, fix issues, report via Matrix, write journal"
version = 1
model = "sonnet"
[context]
files = ["AGENTS.md"]
[[steps]]
id = "preflight"
title = "Review pre-collected metrics"
description = """
The pre-flight metrics have already been collected by supervisor/preflight.sh
and injected into your prompt above. Review them now.
1. Read the injected metrics data carefully (System Resources, Docker,
Active Sessions, Phase Files, Stale Phase Cleanup, Lock Files, Agent Logs,
CI Pipelines, Open PRs, Issue Status, Stale Worktrees, Pending Escalations,
Escalation Replies).
Note: preflight.sh auto-removes PHASE:escalate files for closed issues
(24h grace period). Check the "Stale Phase Cleanup" section for any
files cleaned or in grace period this run.
2. If there are escalation replies from Matrix (human messages), note them —
you will act on them in the decide-actions step.
3. Read the supervisor journal for recent history:
JOURNAL_FILE="$FACTORY_ROOT/supervisor/journal/$(date -u +%Y-%m-%d).md"
if [ -f "$JOURNAL_FILE" ]; then cat "$JOURNAL_FILE"; fi
4. Note any values that cross these thresholds:
- RAM available < 500MB or swap > 3GB → P0 (memory crisis)
- Disk > 80% → P1 (disk pressure)
- Agent sessions dead, CI stuck/pending, git in bad state → P2 (factory stopped)
- PRs stale >20min (CI done, no push since) → P3 (factory degraded)
- Stale worktrees, old lock files → P4 (housekeeping)
"""
[[steps]]
id = "health-assessment"
title = "Evaluate health of each subsystem"
description = """
Categorize every finding from the metrics into priority levels.
### P0 — Memory crisis
- RAM available < 500MB
- Swap used > 3GB AND RAM available < 2000MB
### P1 — Disk pressure
- Disk usage > 80%
### P2 — Factory stopped / stalled
- CI pipelines stuck running > 20min or pending > 30min
- Dev-agent lock file present but process dead
- Dev-agent status unchanged for > 30min
- Git repo on wrong branch or in broken rebase state
- Pipeline stalled: backlog issues exist but no agent ran for > 20min
- Dev-agent blocked: last N polls all report "no ready issues"
- Dev/action sessions in PHASE:escalate for > 24h (escalation timeout)
(Note: PHASE:escalate files for closed issues are auto-cleaned by preflight;
this check covers escalations where the issue is still open)
### P3 — Factory degraded
- PRs stale: CI finished >20min ago AND no git push to the PR branch since CI completed
(Do NOT flag PRs that are actively being worked on — only truly inactive ones)
- Circular dependency deadlocks in backlog
- Stale dependencies (blocked by issues open > 30 days)
### P4 — Housekeeping
- Stale worktrees > 2h old with no active process
- Lock files for dead processes
- Stale claude processes (> 3h old)
List each finding with its priority level. If everything looks healthy,
note "All systems healthy" and proceed.
"""
needs = ["preflight"]
[[steps]]
id = "decide-actions"
title = "Fix what you can, escalate what you cannot"
description = """
For each finding from the health assessment, decide and execute an action.
### Auto-fixable (execute these directly)
**P0 Memory crisis:**
# Kill stale one-shot claude processes (>3h old)
pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
# Drop filesystem caches
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null 2>&1 || true
**P1 Disk pressure:**
# Docker cleanup
sudo docker system prune -f >/dev/null 2>&1 || true
# Truncate logs > 10MB
for f in "$FACTORY_ROOT"/{dev,review,supervisor,gardener,planner,predictor}/*.log; do
[ -f "$f" ] && [ "$(du -k "$f" | cut -f1)" -gt 10240 ] && truncate -s 0 "$f"
done
**P2 Dead lock files:**
rm -f /path/to/stale.lock
**P2 Stale rebase:**
cd "$PROJECT_REPO_ROOT"
git rebase --abort 2>/dev/null
git checkout "$PRIMARY_BRANCH" 2>/dev/null
**P2 Wrong branch:**
cd "$PROJECT_REPO_ROOT"
git checkout "$PRIMARY_BRANCH" 2>/dev/null
**P4 Stale PHASE:escalate files (closed issues):**
Already handled by preflight.sh auto-cleanup. Check "Stale Phase Cleanup"
in the metrics for results. Log any cleanups in the journal.
**P4 Stale worktrees:**
git -C "$PROJECT_REPO_ROOT" worktree remove --force /tmp/stale-worktree 2>/dev/null
git -C "$PROJECT_REPO_ROOT" worktree prune 2>/dev/null
**P4 Stale claude processes:**
pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
**P3 Stale PRs (CI done >20min, no push since):**
Do NOT read dev-poll.sh, push branches, attempt merges, or investigate pipeline code.
Instead, nudge the dev-agent via tmux injection if a session is alive:
# Find the dev session for this issue
SESSION=$(tmux list-sessions -F '#{session_name}' 2>/dev/null | grep "dev-.*-${ISSUE_NUM}" | head -1)
if [ -n "$SESSION" ]; then
# Inject a nudge into the dev-agent session
tmux send-keys -t "$SESSION" "# [supervisor] PR stale >20min — CI finished, please push or update" Enter
fi
If no active tmux session exists, note it in the journal for the next dev-poll cycle.
Do NOT escalate stale PRs to Matrix unless they remain stale for >3 consecutive runs.
### Escalation replies (from Matrix)
If there are escalation replies from a human, act on them:
- "ignore X" note in journal, do not alert on X this run
- "kill that agent" identify and kill the referenced session
- "what's stuck?" include detailed status in the Matrix report
- Other instructions follow them, use best judgment
### Cannot auto-fix → escalate
For P0-P2 issues that persist after auto-fix attempts, or issues requiring
human judgment, prepare an escalation message for the report step.
Read the relevant best-practices file before taking action:
cat "$FACTORY_ROOT/supervisor/best-practices/memory.md" # P0
cat "$FACTORY_ROOT/supervisor/best-practices/disk.md" # P1
cat "$FACTORY_ROOT/supervisor/best-practices/ci.md" # P2 CI
cat "$FACTORY_ROOT/supervisor/best-practices/dev-agent.md" # P2 agent
cat "$FACTORY_ROOT/supervisor/best-practices/git.md" # P2 git
Track what you fixed and what needs escalation for the report step.
"""
needs = ["health-assessment"]
[[steps]]
id = "report"
title = "Post health summary to Matrix"
description = """
Post a status summary to Matrix. Use the matrix_send function:
source "$FACTORY_ROOT/lib/env.sh"
matrix_send "supervisor" "<message>"
### When everything is healthy
Post a brief "all clear" only if the PREVIOUS run had alerts (check journal).
Do NOT post "all clear" every 20 minutes that would be noise.
### When there are findings
Post a summary grouped by priority:
matrix_send "supervisor" "Supervisor health check:
Fixed:
- <what was auto-fixed>
Alerts:
- [P2] <description>
- [P3] <description>
Status: RAM=<X>MB Disk=<Y>% Load=<Z>"
### When escalation is needed (P0-P2 unresolved)
Escalate with a clear call to action:
matrix_send "supervisor" "ESCALATE: <what's wrong and why you can't fix it>
Suggested action: <what the human should do>"
### Responding to escalation replies
If you acted on a human's reply, confirm what you did:
matrix_send "supervisor" "Acted on your reply: <summary of action taken>"
Keep messages concise. Do not post identical messages to what was posted
in the previous run (check journal for prior messages).
"""
needs = ["decide-actions"]
[[steps]]
id = "journal"
title = "Write health journal entry"
description = """
Append a timestamped entry to the supervisor journal.
File path:
$FACTORY_ROOT/supervisor/journal/$(date -u +%Y-%m-%d).md
If the file already exists (multiple runs per day), append a new section.
If it does not exist, create it.
Format:
## Supervisor run — HH:MM UTC
### Health status
- RAM: <X>MB available, Swap: <X>MB
- Disk: <X>%
- Load: <X>
- Docker: <N> containers
### Findings
- [P<N>] <finding> — <action taken or "escalated">
(or "No issues found — all systems healthy")
### Actions taken
- <what was fixed>
(or "No actions needed")
### Escalation replies processed
- <human said X, did Y>
(or "None")
Keep each entry concise — 15-25 lines max. This journal provides
run-to-run context so future supervisor runs can detect trends
(e.g., "disk has been >75% for 3 consecutive runs").
IMPORTANT: Do NOT commit or push the journal — it is a local working file.
The journal directory is committed to git periodically by other agents.
After writing the journal, write the phase signal:
echo 'PHASE:done' > "$PHASE_FILE"
"""
needs = ["report"]