fix: feat: supervisor as formula-driven agent — cron + Matrix escalation (#245)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
7fe5ed0381
commit
d8244742f1
5 changed files with 585 additions and 12 deletions
57
AGENTS.md
57
AGENTS.md
|
|
@ -22,7 +22,10 @@ disinto/
|
|||
├── planner/ planner-run.sh — direct cron executor for run-planner formula
|
||||
│ planner/journal/ — daily raw logs from each planner run
|
||||
│ prediction-poll.sh, prediction-agent.sh — legacy predictor (superseded by predictor/)
|
||||
├── supervisor/ supervisor-poll.sh — health monitoring
|
||||
├── supervisor/ supervisor-run.sh — formula-driven health monitoring (cron wrapper)
|
||||
│ preflight.sh — pre-flight data collection for supervisor formula
|
||||
│ supervisor/journal/ — daily health logs from each run
|
||||
│ supervisor-poll.sh — legacy bash orchestrator (superseded)
|
||||
├── vault/ vault-poll.sh, vault-agent.sh, vault-fire.sh — action gating
|
||||
├── action/ action-poll.sh, action-agent.sh — operational task execution
|
||||
├── lib/ env.sh, agent-session.sh, ci-helpers.sh, ci-debug.sh, load-project.sh, parse-deps.sh, matrix_listener.sh
|
||||
|
|
@ -138,26 +141,56 @@ issue is filed against).
|
|||
|
||||
### Supervisor (`supervisor/`)
|
||||
|
||||
**Role**: Health monitoring and auto-remediation. Two-layer architecture:
|
||||
(1) factory infrastructure checks (RAM, disk, swap, docker, stale processes)
|
||||
that run once, and (2) per-project checks (CI, PRs, dev-agent health,
|
||||
circular deps, stale deps) that iterate over `projects/*.toml`.
|
||||
**Role**: Health monitoring and auto-remediation, executed as a formula-driven
|
||||
Claude agent. Collects system and project metrics via a bash pre-flight script,
|
||||
then runs an interactive Claude session (sonnet) that assesses health, auto-fixes
|
||||
issues, escalates via Matrix, and writes a daily journal.
|
||||
|
||||
**Trigger**: `supervisor-poll.sh` runs every 10 min via cron.
|
||||
**Trigger**: `supervisor-run.sh` runs every 20 min via cron. It creates a tmux
|
||||
session with `claude --model sonnet`, injects `formulas/run-supervisor.toml`
|
||||
with pre-collected metrics as context, monitors the phase file, and cleans up
|
||||
on completion or timeout (20 min max session). No action issues — the supervisor
|
||||
runs directly from cron like the planner and predictor.
|
||||
|
||||
**Key files**:
|
||||
- `supervisor/supervisor-poll.sh` — All checks + auto-fixes (kill stale processes, rotate logs, drop caches, docker prune, abort stale rebases) then invokes `claude -p` for unresolved alerts
|
||||
- `supervisor/update-prompt.sh` — Updates the supervisor prompt file
|
||||
- `supervisor/PROMPT.md` — System prompt for the supervisor's Claude invocation
|
||||
- `supervisor/supervisor-run.sh` — Cron wrapper + orchestrator: lock, memory guard,
|
||||
runs preflight.sh, sources disinto project config, creates tmux session, injects
|
||||
formula prompt with metrics, monitors phase file, handles crash recovery via
|
||||
`run_formula_and_monitor`
|
||||
- `supervisor/preflight.sh` — Data collection: system resources (RAM, disk, swap,
|
||||
load), Docker status, active tmux sessions + phase files, lock files, agent log
|
||||
tails, CI pipeline status, open PRs, issue counts, stale worktrees, pending
|
||||
escalations, Matrix escalation replies
|
||||
- `formulas/run-supervisor.toml` — Execution spec: five steps (preflight review,
|
||||
health-assessment, decide-actions, report, journal) with `needs` dependencies.
|
||||
Claude evaluates all metrics and takes actions in a single interactive session
|
||||
- `supervisor/journal/*.md` — Daily health logs from each supervisor run (local,
|
||||
committed periodically)
|
||||
- `supervisor/PROMPT.md` — Best-practices reference for remediation actions
|
||||
- `supervisor/best-practices/*.md` — Domain-specific remediation guides (memory,
|
||||
disk, CI, git, dev-agent, review-agent, codeberg)
|
||||
- `supervisor/supervisor-poll.sh` — Legacy bash orchestrator (superseded by
|
||||
supervisor-run.sh + formula)
|
||||
|
||||
**Alert priorities**: P0 (memory crisis), P1 (disk), P2 (factory stopped/stalled),
|
||||
P3 (degraded PRs, circular deps, stale deps), P4 (housekeeping).
|
||||
|
||||
**Matrix integration**: The supervisor has its own Matrix thread. Posts health
|
||||
summaries when there are changes, escalates P0-P2 issues, and processes replies
|
||||
from humans ("ignore disk warning", "kill that agent", "what's stuck?"). The
|
||||
Matrix listener routes thread replies to `/tmp/supervisor-escalation-reply`,
|
||||
which `supervisor-run.sh` consumes atomically on each run.
|
||||
|
||||
**Environment variables consumed**:
|
||||
- All from `lib/env.sh` + per-project TOML overrides
|
||||
- `CODEBERG_TOKEN`, `CODEBERG_REPO`, `CODEBERG_API`, `PROJECT_NAME`, `PROJECT_REPO_ROOT`
|
||||
- `PRIMARY_BRANCH`, `CLAUDE_MODEL` (set to sonnet by supervisor-run.sh)
|
||||
- `WOODPECKER_TOKEN`, `WOODPECKER_SERVER`, `WOODPECKER_DB_PASSWORD`, `WOODPECKER_DB_USER`, `WOODPECKER_DB_HOST`, `WOODPECKER_DB_NAME` — CI database queries
|
||||
- `CHECK_PRS`, `CHECK_DEV_AGENT`, `CHECK_PIPELINE_STALL` — Per-project monitoring toggles (from TOML `[monitoring]` section)
|
||||
- `CHECK_INFRA_RETRY` — Infra failure retry toggle (env var only, defaults to `true`; not configurable via project TOML)
|
||||
- `MATRIX_TOKEN`, `MATRIX_ROOM_ID`, `MATRIX_HOMESERVER` — Matrix notifications + human input
|
||||
|
||||
**Lifecycle**: supervisor-run.sh (cron */20) → lock + memory guard → run
|
||||
preflight.sh (collect metrics) → consume escalation replies → load formula +
|
||||
context → create tmux session → Claude assesses health, auto-fixes, posts
|
||||
Matrix summary, writes journal → `PHASE:done`.
|
||||
|
||||
### Planner (`planner/`)
|
||||
|
||||
|
|
|
|||
237
formulas/run-supervisor.toml
Normal file
237
formulas/run-supervisor.toml
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
# formulas/run-supervisor.toml — Supervisor formula (health monitoring + remediation)
|
||||
#
|
||||
# Executed by supervisor/supervisor-run.sh via cron (every 20 minutes).
|
||||
# supervisor-run.sh creates a tmux session with Claude (sonnet) and injects
|
||||
# this formula with pre-collected metrics as context.
|
||||
#
|
||||
# Steps: preflight → health-assessment → decide-actions → report → journal
|
||||
#
|
||||
# Key differences from planner/gardener:
|
||||
# - Runs every 20min — lightweight health check
|
||||
# - Primarily READS state, rarely WRITES (no PRs, just Matrix + journal)
|
||||
# - Reactive to escalations — processes pending escalation events
|
||||
# - Conversation memory via Matrix thread and journal
|
||||
|
||||
name = "run-supervisor"
|
||||
description = "Factory health monitoring: assess metrics, fix issues, report via Matrix, write journal"
|
||||
version = 1
|
||||
model = "sonnet"
|
||||
|
||||
[context]
|
||||
files = ["AGENTS.md"]
|
||||
|
||||
[[steps]]
|
||||
id = "preflight"
|
||||
title = "Review pre-collected metrics"
|
||||
description = """
|
||||
The pre-flight metrics have already been collected by supervisor/preflight.sh
|
||||
and injected into your prompt above. Review them now.
|
||||
|
||||
1. Read the injected metrics data carefully (System Resources, Docker,
|
||||
Active Sessions, Phase Files, Lock Files, Agent Logs, CI Pipelines,
|
||||
Open PRs, Issue Status, Stale Worktrees, Pending Escalations,
|
||||
Escalation Replies).
|
||||
|
||||
2. If there are escalation replies from Matrix (human messages), note them —
|
||||
you will act on them in the decide-actions step.
|
||||
|
||||
3. Read the supervisor journal for recent history:
|
||||
JOURNAL_FILE="$FACTORY_ROOT/supervisor/journal/$(date -u +%Y-%m-%d).md"
|
||||
if [ -f "$JOURNAL_FILE" ]; then cat "$JOURNAL_FILE"; fi
|
||||
|
||||
4. Note any values that cross these thresholds:
|
||||
- RAM available < 500MB or swap > 3GB → P0 (memory crisis)
|
||||
- Disk > 80% → P1 (disk pressure)
|
||||
- Agent sessions dead, CI stuck/pending, git in bad state → P2 (factory stopped)
|
||||
- PRs stale, unreviewed, or with merge conflicts → P3 (factory degraded)
|
||||
- Stale worktrees, old lock files → P4 (housekeeping)
|
||||
"""
|
||||
|
||||
[[steps]]
|
||||
id = "health-assessment"
|
||||
title = "Evaluate health of each subsystem"
|
||||
description = """
|
||||
Categorize every finding from the metrics into priority levels.
|
||||
|
||||
### P0 — Memory crisis
|
||||
- RAM available < 500MB
|
||||
- Swap used > 3GB AND RAM available < 2000MB
|
||||
|
||||
### P1 — Disk pressure
|
||||
- Disk usage > 80%
|
||||
|
||||
### P2 — Factory stopped / stalled
|
||||
- CI pipelines stuck running > 20min or pending > 30min
|
||||
- Dev-agent lock file present but process dead
|
||||
- Dev-agent status unchanged for > 30min
|
||||
- Git repo on wrong branch or in broken rebase state
|
||||
- Pipeline stalled: backlog issues exist but no agent ran for > 20min
|
||||
- Dev-agent blocked: last N polls all report "no ready issues"
|
||||
- Dev sessions in PHASE:needs_human for > 24h
|
||||
|
||||
### P3 — Factory degraded
|
||||
- PRs with CI pass but merge conflict (needs rebase)
|
||||
- PRs with CI failure stale > 30min
|
||||
- PRs with CI pass but no review for > 60min
|
||||
- Circular dependency deadlocks in backlog
|
||||
- Stale dependencies (blocked by issues open > 30 days)
|
||||
|
||||
### P4 — Housekeeping
|
||||
- Stale worktrees > 2h old with no active process
|
||||
- Lock files for dead processes
|
||||
- Stale claude processes (> 3h old)
|
||||
|
||||
List each finding with its priority level. If everything looks healthy,
|
||||
note "All systems healthy" and proceed.
|
||||
"""
|
||||
needs = ["preflight"]
|
||||
|
||||
[[steps]]
|
||||
id = "decide-actions"
|
||||
title = "Fix what you can, escalate what you cannot"
|
||||
description = """
|
||||
For each finding from the health assessment, decide and execute an action.
|
||||
|
||||
### Auto-fixable (execute these directly)
|
||||
|
||||
**P0 Memory crisis:**
|
||||
# Kill stale one-shot claude processes (>3h old)
|
||||
pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
|
||||
# Drop filesystem caches
|
||||
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null 2>&1 || true
|
||||
|
||||
**P1 Disk pressure:**
|
||||
# Docker cleanup
|
||||
sudo docker system prune -f >/dev/null 2>&1 || true
|
||||
# Truncate logs > 10MB
|
||||
for f in "$FACTORY_ROOT"/{dev,review,supervisor,gardener,planner,predictor}/*.log; do
|
||||
[ -f "$f" ] && [ "$(du -k "$f" | cut -f1)" -gt 10240 ] && truncate -s 0 "$f"
|
||||
done
|
||||
|
||||
**P2 Dead lock files:**
|
||||
rm -f /path/to/stale.lock
|
||||
|
||||
**P2 Stale rebase:**
|
||||
cd "$PROJECT_REPO_ROOT"
|
||||
git rebase --abort 2>/dev/null
|
||||
git checkout "$PRIMARY_BRANCH" 2>/dev/null
|
||||
|
||||
**P2 Wrong branch:**
|
||||
cd "$PROJECT_REPO_ROOT"
|
||||
git checkout "$PRIMARY_BRANCH" 2>/dev/null
|
||||
|
||||
**P4 Stale worktrees:**
|
||||
git -C "$PROJECT_REPO_ROOT" worktree remove --force /tmp/stale-worktree 2>/dev/null
|
||||
git -C "$PROJECT_REPO_ROOT" worktree prune 2>/dev/null
|
||||
|
||||
**P4 Stale claude processes:**
|
||||
pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
|
||||
|
||||
### Escalation replies (from Matrix)
|
||||
|
||||
If there are escalation replies from a human, act on them:
|
||||
- "ignore X" → note in journal, do not alert on X this run
|
||||
- "kill that agent" → identify and kill the referenced session
|
||||
- "what's stuck?" → include detailed status in the Matrix report
|
||||
- Other instructions → follow them, use best judgment
|
||||
|
||||
### Cannot auto-fix → escalate
|
||||
|
||||
For P0-P2 issues that persist after auto-fix attempts, or issues requiring
|
||||
human judgment, prepare an escalation message for the report step.
|
||||
|
||||
Read the relevant best-practices file before taking action:
|
||||
cat "$FACTORY_ROOT/supervisor/best-practices/memory.md" # P0
|
||||
cat "$FACTORY_ROOT/supervisor/best-practices/disk.md" # P1
|
||||
cat "$FACTORY_ROOT/supervisor/best-practices/ci.md" # P2 CI
|
||||
cat "$FACTORY_ROOT/supervisor/best-practices/dev-agent.md" # P2 agent
|
||||
cat "$FACTORY_ROOT/supervisor/best-practices/git.md" # P2 git
|
||||
|
||||
Track what you fixed and what needs escalation for the report step.
|
||||
"""
|
||||
needs = ["health-assessment"]
|
||||
|
||||
[[steps]]
|
||||
id = "report"
|
||||
title = "Post health summary to Matrix"
|
||||
description = """
|
||||
Post a status summary to Matrix. Use the matrix_send function:
|
||||
source "$FACTORY_ROOT/lib/env.sh"
|
||||
matrix_send "supervisor" "<message>"
|
||||
|
||||
### When everything is healthy
|
||||
Post a brief "all clear" only if the PREVIOUS run had alerts (check journal).
|
||||
Do NOT post "all clear" every 20 minutes — that would be noise.
|
||||
|
||||
### When there are findings
|
||||
Post a summary grouped by priority:
|
||||
matrix_send "supervisor" "Supervisor health check:
|
||||
|
||||
Fixed:
|
||||
- <what was auto-fixed>
|
||||
|
||||
Alerts:
|
||||
- [P2] <description>
|
||||
- [P3] <description>
|
||||
|
||||
Status: RAM=<X>MB Disk=<Y>% Load=<Z>"
|
||||
|
||||
### When escalation is needed (P0-P2 unresolved)
|
||||
Escalate with a clear call to action:
|
||||
matrix_send "supervisor" "ESCALATE: <what's wrong and why you can't fix it>
|
||||
|
||||
Suggested action: <what the human should do>"
|
||||
|
||||
### Responding to escalation replies
|
||||
If you acted on a human's reply, confirm what you did:
|
||||
matrix_send "supervisor" "Acted on your reply: <summary of action taken>"
|
||||
|
||||
Keep messages concise. Do not post identical messages to what was posted
|
||||
in the previous run (check journal for prior messages).
|
||||
"""
|
||||
needs = ["decide-actions"]
|
||||
|
||||
[[steps]]
|
||||
id = "journal"
|
||||
title = "Write health journal entry"
|
||||
description = """
|
||||
Append a timestamped entry to the supervisor journal.
|
||||
|
||||
File path:
|
||||
$FACTORY_ROOT/supervisor/journal/$(date -u +%Y-%m-%d).md
|
||||
|
||||
If the file already exists (multiple runs per day), append a new section.
|
||||
If it does not exist, create it.
|
||||
|
||||
Format:
|
||||
## Supervisor run — HH:MM UTC
|
||||
|
||||
### Health status
|
||||
- RAM: <X>MB available, Swap: <X>MB
|
||||
- Disk: <X>%
|
||||
- Load: <X>
|
||||
- Docker: <N> containers
|
||||
|
||||
### Findings
|
||||
- [P<N>] <finding> — <action taken or "escalated">
|
||||
(or "No issues found — all systems healthy")
|
||||
|
||||
### Actions taken
|
||||
- <what was fixed>
|
||||
(or "No actions needed")
|
||||
|
||||
### Escalation replies processed
|
||||
- <human said X, did Y>
|
||||
(or "None")
|
||||
|
||||
Keep each entry concise — 15-25 lines max. This journal provides
|
||||
run-to-run context so future supervisor runs can detect trends
|
||||
(e.g., "disk has been >75% for 3 consecutive runs").
|
||||
|
||||
IMPORTANT: Do NOT commit or push the journal — it is a local working file.
|
||||
The journal directory is committed to git periodically by other agents.
|
||||
|
||||
After writing the journal, write the phase signal:
|
||||
echo 'PHASE:done' > '$PHASE_FILE'
|
||||
"""
|
||||
needs = ["report"]
|
||||
0
supervisor/journal/.gitkeep
Normal file
0
supervisor/journal/.gitkeep
Normal file
188
supervisor/preflight.sh
Executable file
188
supervisor/preflight.sh
Executable file
|
|
@ -0,0 +1,188 @@
|
|||
#!/usr/bin/env bash
|
||||
# =============================================================================
|
||||
# preflight.sh — Collect system and project metrics for the supervisor formula
|
||||
#
|
||||
# Outputs structured text to stdout. Called by supervisor-run.sh before
|
||||
# launching the Claude session. The output is injected into the prompt.
|
||||
#
|
||||
# Usage:
|
||||
# bash supervisor/preflight.sh [projects/disinto.toml]
|
||||
# =============================================================================
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
FACTORY_ROOT="$(dirname "$SCRIPT_DIR")"
|
||||
|
||||
export PROJECT_TOML="${1:-$FACTORY_ROOT/projects/disinto.toml}"
|
||||
# shellcheck source=../lib/env.sh
|
||||
source "$FACTORY_ROOT/lib/env.sh"
|
||||
# shellcheck source=../lib/ci-helpers.sh
|
||||
source "$FACTORY_ROOT/lib/ci-helpers.sh"
|
||||
|
||||
# ── System Resources ─────────────────────────────────────────────────────
|
||||
|
||||
echo "## System Resources"
|
||||
|
||||
_avail_mb=$(free -m | awk '/Mem:/{print $7}')
|
||||
_total_mb=$(free -m | awk '/Mem:/{print $2}')
|
||||
_swap_used=$(free -m | awk '/Swap:/{print $3}')
|
||||
_disk_pct=$(df -h / | awk 'NR==2{print $5}' | tr -d '%')
|
||||
_disk_used=$(df -h / | awk 'NR==2{print $3}')
|
||||
_disk_total=$(df -h / | awk 'NR==2{print $2}')
|
||||
_load=$(cat /proc/loadavg 2>/dev/null || echo "unknown")
|
||||
|
||||
echo "RAM: ${_avail_mb}MB available / ${_total_mb}MB total, Swap: ${_swap_used}MB used"
|
||||
echo "Disk: ${_disk_pct}% used (${_disk_used}/${_disk_total} on /)"
|
||||
echo "Load: ${_load}"
|
||||
echo ""
|
||||
|
||||
# ── Docker ────────────────────────────────────────────────────────────────
|
||||
|
||||
echo "## Docker"
|
||||
if command -v docker &>/dev/null; then
|
||||
docker ps --format 'table {{.Names}}\t{{.Status}}' 2>/dev/null || echo "Docker query failed"
|
||||
else
|
||||
echo "Docker not available"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# ── Active Sessions + Phase Files ─────────────────────────────────────────
|
||||
|
||||
echo "## Active Sessions"
|
||||
if tmux list-sessions 2>/dev/null; then
|
||||
:
|
||||
else
|
||||
echo "No tmux sessions"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
echo "## Phase Files"
|
||||
_found_phase=false
|
||||
for _pf in /tmp/*-session-*.phase; do
|
||||
[ -f "$_pf" ] || continue
|
||||
_found_phase=true
|
||||
_phase_content=$(head -1 "$_pf" 2>/dev/null || echo "unreadable")
|
||||
_phase_age_min=$(( ($(date +%s) - $(stat -c %Y "$_pf" 2>/dev/null || echo 0)) / 60 ))
|
||||
echo " $(basename "$_pf"): ${_phase_content} (${_phase_age_min}min ago)"
|
||||
done
|
||||
[ "$_found_phase" = false ] && echo " None"
|
||||
echo ""
|
||||
|
||||
# ── Lock Files ────────────────────────────────────────────────────────────
|
||||
|
||||
echo "## Lock Files"
|
||||
_found_lock=false
|
||||
for _lf in /tmp/*-poll.lock /tmp/*-run.lock /tmp/dev-agent-*.lock; do
|
||||
[ -f "$_lf" ] || continue
|
||||
_found_lock=true
|
||||
_pid=$(cat "$_lf" 2>/dev/null || true)
|
||||
_age_min=$(( ($(date +%s) - $(stat -c %Y "$_lf" 2>/dev/null || echo 0)) / 60 ))
|
||||
_alive="dead"
|
||||
[ -n "${_pid:-}" ] && kill -0 "$_pid" 2>/dev/null && _alive="alive"
|
||||
echo " $(basename "$_lf"): PID=${_pid:-?} ${_alive} age=${_age_min}min"
|
||||
done
|
||||
[ "$_found_lock" = false ] && echo " None"
|
||||
echo ""
|
||||
|
||||
# ── Agent Logs (last 15 lines each) ──────────────────────────────────────
|
||||
|
||||
echo "## Recent Agent Logs"
|
||||
for _log in supervisor/supervisor.log dev/dev-agent.log review/review.log \
|
||||
gardener/gardener.log planner/planner.log predictor/predictor.log \
|
||||
action/action.log; do
|
||||
_logpath="${FACTORY_ROOT}/${_log}"
|
||||
if [ -f "$_logpath" ]; then
|
||||
_log_age_min=$(( ($(date +%s) - $(stat -c %Y "$_logpath" 2>/dev/null || echo 0)) / 60 ))
|
||||
echo "### ${_log} (last modified ${_log_age_min}min ago)"
|
||||
tail -15 "$_logpath" 2>/dev/null || echo "(read failed)"
|
||||
echo ""
|
||||
fi
|
||||
done
|
||||
|
||||
# ── CI Pipelines ──────────────────────────────────────────────────────────
|
||||
|
||||
echo "## CI Pipelines (${PROJECT_NAME})"
|
||||
|
||||
_recent_ci=$(wpdb -A -c "
|
||||
SELECT number, status, branch,
|
||||
ROUND(EXTRACT(EPOCH FROM (to_timestamp(finished) - to_timestamp(started)))/60)::int as dur_min
|
||||
FROM pipelines
|
||||
WHERE repo_id = ${WOODPECKER_REPO_ID}
|
||||
AND finished > 0
|
||||
AND to_timestamp(finished) > now() - interval '24 hours'
|
||||
ORDER BY number DESC LIMIT 10;" 2>/dev/null || echo "CI database query failed")
|
||||
echo "$_recent_ci"
|
||||
|
||||
_stuck=$(wpdb -c "
|
||||
SELECT count(*) FROM pipelines
|
||||
WHERE repo_id=${WOODPECKER_REPO_ID}
|
||||
AND status='running'
|
||||
AND EXTRACT(EPOCH FROM now() - to_timestamp(started)) > 1200;" 2>/dev/null | xargs || echo "?")
|
||||
|
||||
_pending=$(wpdb -c "
|
||||
SELECT count(*) FROM pipelines
|
||||
WHERE repo_id=${WOODPECKER_REPO_ID}
|
||||
AND status='pending'
|
||||
AND EXTRACT(EPOCH FROM now() - to_timestamp(created)) > 1800;" 2>/dev/null | xargs || echo "?")
|
||||
|
||||
echo "Stuck (>20min): ${_stuck}"
|
||||
echo "Pending (>30min): ${_pending}"
|
||||
echo ""
|
||||
|
||||
# ── Open PRs ──────────────────────────────────────────────────────────────
|
||||
|
||||
echo "## Open PRs (${PROJECT_NAME})"
|
||||
_open_prs=$(codeberg_api GET "/pulls?state=open&limit=10" 2>/dev/null || echo "[]")
|
||||
echo "$_open_prs" | jq -r '.[] | "#\(.number) [\(.head.ref)] \(.title) — updated \(.updated_at)"' 2>/dev/null || echo "No open PRs or query failed"
|
||||
echo ""
|
||||
|
||||
# ── Backlog + In-Progress ─────────────────────────────────────────────────
|
||||
|
||||
echo "## Issue Status (${PROJECT_NAME})"
|
||||
_backlog_count=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq 'length' 2>/dev/null || echo "?")
|
||||
_in_progress_count=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq 'length' 2>/dev/null || echo "?")
|
||||
_blocked_count=$(codeberg_api GET "/issues?state=open&labels=blocked&type=issues&limit=1" 2>/dev/null | jq 'length' 2>/dev/null || echo "?")
|
||||
echo "Backlog: ${_backlog_count}, In-progress: ${_in_progress_count}, Blocked: ${_blocked_count}"
|
||||
echo ""
|
||||
|
||||
# ── Stale Worktrees ───────────────────────────────────────────────────────
|
||||
|
||||
echo "## Stale Worktrees"
|
||||
_found_wt=false
|
||||
for _wt in /tmp/*-worktree-* /tmp/*-review-*; do
|
||||
[ -d "$_wt" ] || continue
|
||||
_found_wt=true
|
||||
_wt_age_min=$(( ($(date +%s) - $(stat -c %Y "$_wt" 2>/dev/null || echo 0)) / 60 ))
|
||||
echo " $(basename "$_wt"): ${_wt_age_min}min old"
|
||||
done
|
||||
[ "$_found_wt" = false ] && echo " None"
|
||||
echo ""
|
||||
|
||||
# ── Pending Escalations ──────────────────────────────────────────────────
|
||||
|
||||
echo "## Pending Escalations"
|
||||
_found_esc=false
|
||||
for _esc_file in "${FACTORY_ROOT}/supervisor/escalations-"*.jsonl; do
|
||||
[ -f "$_esc_file" ] || continue
|
||||
[[ "$_esc_file" == *.done.jsonl ]] && continue
|
||||
_esc_count=$(wc -l < "$_esc_file" 2>/dev/null || echo 0)
|
||||
[ "${_esc_count:-0}" -gt 0 ] || continue
|
||||
_found_esc=true
|
||||
echo "### $(basename "$_esc_file") (${_esc_count} entries)"
|
||||
cat "$_esc_file"
|
||||
echo ""
|
||||
done
|
||||
[ "$_found_esc" = false ] && echo " None"
|
||||
echo ""
|
||||
|
||||
# ── Escalation Replies from Matrix ────────────────────────────────────────
|
||||
|
||||
echo "## Escalation Replies (from Matrix)"
|
||||
if [ -s /tmp/supervisor-escalation-reply ]; then
|
||||
cat /tmp/supervisor-escalation-reply
|
||||
echo ""
|
||||
echo "(This reply will be consumed after this run)"
|
||||
else
|
||||
echo " None"
|
||||
fi
|
||||
echo ""
|
||||
115
supervisor/supervisor-run.sh
Executable file
115
supervisor/supervisor-run.sh
Executable file
|
|
@ -0,0 +1,115 @@
|
|||
#!/usr/bin/env bash
|
||||
# =============================================================================
|
||||
# supervisor-run.sh — Cron wrapper: supervisor execution via Claude + formula
|
||||
#
|
||||
# Runs every 20 minutes (or on-demand). Guards against concurrent runs and
|
||||
# low memory. Collects metrics via preflight.sh, then creates a tmux session
|
||||
# with Claude (sonnet) reading formulas/run-supervisor.toml.
|
||||
#
|
||||
# Replaces supervisor-poll.sh (bash orchestrator + claude -p one-shot) with
|
||||
# formula-driven interactive Claude session matching the planner/predictor
|
||||
# pattern.
|
||||
#
|
||||
# Usage:
|
||||
# supervisor-run.sh [projects/disinto.toml] # project config (default: disinto)
|
||||
#
|
||||
# Cron: */20 * * * * cd /path/to/dark-factory && bash supervisor/supervisor-run.sh
|
||||
# =============================================================================
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
FACTORY_ROOT="$(dirname "$SCRIPT_DIR")"
|
||||
|
||||
# Accept project config from argument; default to disinto
|
||||
export PROJECT_TOML="${1:-$FACTORY_ROOT/projects/disinto.toml}"
|
||||
# shellcheck source=../lib/env.sh
|
||||
source "$FACTORY_ROOT/lib/env.sh"
|
||||
# shellcheck source=../lib/agent-session.sh
|
||||
source "$FACTORY_ROOT/lib/agent-session.sh"
|
||||
# shellcheck source=../lib/formula-session.sh
|
||||
source "$FACTORY_ROOT/lib/formula-session.sh"
|
||||
|
||||
LOG_FILE="$SCRIPT_DIR/supervisor.log"
|
||||
# shellcheck disable=SC2034 # consumed by run_formula_and_monitor
|
||||
SESSION_NAME="supervisor-${PROJECT_NAME}"
|
||||
PHASE_FILE="/tmp/supervisor-session-${PROJECT_NAME}.phase"
|
||||
|
||||
# shellcheck disable=SC2034 # read by monitor_phase_loop in lib/agent-session.sh
|
||||
PHASE_POLL_INTERVAL=15
|
||||
|
||||
SCRATCH_FILE="/tmp/supervisor-${PROJECT_NAME}-scratch.md"
|
||||
|
||||
log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%S)Z] $*" >> "$LOG_FILE"; }
|
||||
|
||||
# ── Guards ────────────────────────────────────────────────────────────────
|
||||
acquire_cron_lock "/tmp/supervisor-run.lock"
|
||||
check_memory 2000
|
||||
|
||||
log "--- Supervisor run start ---"
|
||||
|
||||
# ── Collect pre-flight metrics ────────────────────────────────────────────
|
||||
log "Running preflight.sh"
|
||||
PREFLIGHT_OUTPUT=""
|
||||
if PREFLIGHT_OUTPUT=$(bash "$SCRIPT_DIR/preflight.sh" "$PROJECT_TOML" 2>&1); then
|
||||
log "Preflight collected ($(echo "$PREFLIGHT_OUTPUT" | wc -l) lines)"
|
||||
else
|
||||
log "WARNING: preflight.sh failed, continuing with partial data"
|
||||
fi
|
||||
|
||||
# ── Consume escalation replies ────────────────────────────────────────────
|
||||
# Move the file atomically so matrix_listener can write a new one
|
||||
ESCALATION_REPLY=""
|
||||
if [ -s /tmp/supervisor-escalation-reply ]; then
|
||||
_reply_tmp="/tmp/supervisor-escalation-reply.consumed.$$"
|
||||
if mv /tmp/supervisor-escalation-reply "$_reply_tmp" 2>/dev/null; then
|
||||
ESCALATION_REPLY=$(cat "$_reply_tmp")
|
||||
rm -f "$_reply_tmp"
|
||||
log "Consumed escalation reply: $(echo "$ESCALATION_REPLY" | head -1)"
|
||||
fi
|
||||
fi
|
||||
|
||||
# ── Load formula + context ───────────────────────────────────────────────
|
||||
load_formula "$FACTORY_ROOT/formulas/run-supervisor.toml"
|
||||
build_context_block AGENTS.md
|
||||
|
||||
# ── Read scratch file (compaction survival) ───────────────────────────────
|
||||
SCRATCH_CONTEXT=$(read_scratch_context "$SCRATCH_FILE")
|
||||
SCRATCH_INSTRUCTION=$(build_scratch_instruction "$SCRATCH_FILE")
|
||||
|
||||
# ── Build prompt ─────────────────────────────────────────────────────────
|
||||
build_prompt_footer
|
||||
|
||||
# shellcheck disable=SC2034 # consumed by run_formula_and_monitor
|
||||
PROMPT="You are the supervisor agent for ${CODEBERG_REPO}. Work through the formula below. You MUST write PHASE:done to '${PHASE_FILE}' when finished — the orchestrator will time you out if you return to the prompt without signalling.
|
||||
|
||||
You have full shell access and --dangerously-skip-permissions.
|
||||
Fix what you can. Escalate what you cannot. Do NOT ask permission — act first, report after.
|
||||
|
||||
## Pre-flight metrics (collected $(date -u +%H:%M) UTC)
|
||||
${PREFLIGHT_OUTPUT}
|
||||
${ESCALATION_REPLY:+
|
||||
## Escalation Reply (from Matrix — human message)
|
||||
${ESCALATION_REPLY}
|
||||
|
||||
Act on this reply in the decide-actions step.
|
||||
}
|
||||
## Project context
|
||||
${CONTEXT_BLOCK}
|
||||
${SCRATCH_CONTEXT:+${SCRATCH_CONTEXT}
|
||||
}
|
||||
## Formula
|
||||
${FORMULA_CONTENT}
|
||||
|
||||
${SCRATCH_INSTRUCTION}
|
||||
|
||||
${PROMPT_FOOTER}"
|
||||
|
||||
# ── Run session ──────────────────────────────────────────────────────────
|
||||
export CLAUDE_MODEL="sonnet"
|
||||
run_formula_and_monitor "supervisor" 1200
|
||||
|
||||
# ── Cleanup scratch file on normal exit ──────────────────────────────────
|
||||
# FINAL_PHASE already set by run_formula_and_monitor
|
||||
if [ "${FINAL_PHASE:-}" = "PHASE:done" ]; then
|
||||
rm -f "$SCRATCH_FILE"
|
||||
fi
|
||||
Loading…
Add table
Add a link
Reference in a new issue