fix: Remove escalation — planner routes through vault instead (#721)
Remove ESCALATED signal and escalation handling from planner, supervisor, and gardener. When blocked on external resources or human decisions, these agents now file vault procurement items (vault/pending/*.md) instead of escalating directly to the human. Changes: - Planner formula: ESCALATED signal replaced with HUMAN_BLOCKED; files vault items and marks prerequisites as blocked-on-vault - Supervisor formula/prompt: escalation sections replaced with vault item filing; preflight now reports pending vault items instead of escalation replies - Gardener formula: ESCALATE action replaced with VAULT action; files vault/pending/*.md for human decisions - Groom-backlog formula: same ESCALATE→VAULT replacement - Gardener shell: PHASE:escalate replaced with PHASE:failed for merge blocks and CI exhaustion; escalation reply consumption removed - Supervisor shell: escalation reply consumption removed from both supervisor-run.sh and legacy supervisor-poll.sh - Prerequisite tree: #466 updated from "escalated" to "blocked-on-vault" The vault is the factory's only interface to the human for resources and approvals. Dev/action agents retain PHASE:escalate for operational session issues (CI timeouts, merge blocks) which are a different mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
850a8d743f
commit
f2064ba67c
11 changed files with 117 additions and 113 deletions
|
|
@ -9,7 +9,7 @@
|
|||
# Key differences from planner/gardener:
|
||||
# - Runs every 20min — lightweight health check
|
||||
# - Primarily READS state, rarely WRITES (no PRs, just Matrix + journal)
|
||||
# - Reactive to escalations — processes pending escalation events
|
||||
# - Checks vault state for pending procurement items
|
||||
# - Conversation memory via Matrix thread and journal
|
||||
|
||||
name = "run-supervisor"
|
||||
|
|
@ -29,14 +29,14 @@ and injected into your prompt above. Review them now.
|
|||
|
||||
1. Read the injected metrics data carefully (System Resources, Docker,
|
||||
Active Sessions, Phase Files, Stale Phase Cleanup, Lock Files, Agent Logs,
|
||||
CI Pipelines, Open PRs, Issue Status, Stale Worktrees, Pending Escalations,
|
||||
Escalation Replies).
|
||||
CI Pipelines, Open PRs, Issue Status, Stale Worktrees).
|
||||
Note: preflight.sh auto-removes PHASE:escalate files for closed issues
|
||||
(24h grace period). Check the "Stale Phase Cleanup" section for any
|
||||
files cleaned or in grace period this run.
|
||||
|
||||
2. If there are escalation replies from Matrix (human messages), note them —
|
||||
you will act on them in the decide-actions step.
|
||||
2. Check vault state: read vault/pending/*.md for any procurement items
|
||||
the planner has filed. Note items relevant to the health assessment
|
||||
(e.g. a blocked resource that explains why the pipeline is stalled).
|
||||
|
||||
3. Read the supervisor journal for recent history:
|
||||
JOURNAL_FILE="$FACTORY_ROOT/supervisor/journal/$(date -u +%Y-%m-%d).md"
|
||||
|
|
@ -70,9 +70,9 @@ Categorize every finding from the metrics into priority levels.
|
|||
- Git repo on wrong branch or in broken rebase state
|
||||
- Pipeline stalled: backlog issues exist but no agent ran for > 20min
|
||||
- Dev-agent blocked: last N polls all report "no ready issues"
|
||||
- Dev/action sessions in PHASE:escalate for > 24h (escalation timeout)
|
||||
- Dev/action sessions in PHASE:escalate for > 24h (session timeout)
|
||||
(Note: PHASE:escalate files for closed issues are auto-cleaned by preflight;
|
||||
this check covers escalations where the issue is still open)
|
||||
this check covers sessions where the issue is still open)
|
||||
|
||||
### P3 — Factory degraded
|
||||
- PRs stale: CI finished >20min ago AND no git push to the PR branch since CI completed
|
||||
|
|
@ -92,7 +92,7 @@ needs = ["preflight"]
|
|||
|
||||
[[steps]]
|
||||
id = "decide-actions"
|
||||
title = "Fix what you can, escalate what you cannot"
|
||||
title = "Fix what you can, file vault items for what you cannot"
|
||||
description = """
|
||||
For each finding from the health assessment, decide and execute an action.
|
||||
|
||||
|
|
@ -145,20 +145,21 @@ For each finding from the health assessment, decide and execute an action.
|
|||
tmux send-keys -t "$SESSION" "# [supervisor] PR stale >20min — CI finished, please push or update" Enter
|
||||
fi
|
||||
If no active tmux session exists, note it in the journal for the next dev-poll cycle.
|
||||
Do NOT escalate stale PRs to Matrix unless they remain stale for >3 consecutive runs.
|
||||
Do NOT file vault items for stale PRs unless they remain stale for >3 consecutive runs.
|
||||
|
||||
### Escalation replies (from Matrix)
|
||||
|
||||
If there are escalation replies from a human, act on them:
|
||||
- "ignore X" → note in journal, do not alert on X this run
|
||||
- "kill that agent" → identify and kill the referenced session
|
||||
- "what's stuck?" → include detailed status in the Matrix report
|
||||
- Other instructions → follow them, use best judgment
|
||||
|
||||
### Cannot auto-fix → escalate
|
||||
### Cannot auto-fix → file vault item
|
||||
|
||||
For P0-P2 issues that persist after auto-fix attempts, or issues requiring
|
||||
human judgment, prepare an escalation message for the report step.
|
||||
human judgment, file a vault procurement item:
|
||||
Write $PROJECT_REPO_ROOT/vault/pending/supervisor-<issue-slug>.md:
|
||||
# <What is needed>
|
||||
## What
|
||||
<description of the problem and why the supervisor cannot fix it>
|
||||
## Why
|
||||
<impact on factory health — reference the priority level>
|
||||
## Unblocks
|
||||
- Factory health: <what this resolves>
|
||||
The vault-poll will notify the human and track the request.
|
||||
|
||||
Read the relevant best-practices file before taking action:
|
||||
cat "$FACTORY_ROOT/supervisor/best-practices/memory.md" # P0
|
||||
|
|
@ -167,7 +168,7 @@ Read the relevant best-practices file before taking action:
|
|||
cat "$FACTORY_ROOT/supervisor/best-practices/dev-agent.md" # P2 agent
|
||||
cat "$FACTORY_ROOT/supervisor/best-practices/git.md" # P2 git
|
||||
|
||||
Track what you fixed and what needs escalation for the report step.
|
||||
Track what you fixed and what vault items you filed for the report step.
|
||||
"""
|
||||
needs = ["health-assessment"]
|
||||
|
||||
|
|
@ -196,15 +197,14 @@ Post a summary grouped by priority:
|
|||
|
||||
Status: RAM=<X>MB Disk=<Y>% Load=<Z>"
|
||||
|
||||
### When escalation is needed (P0-P2 unresolved)
|
||||
Escalate with a clear call to action:
|
||||
matrix_send "supervisor" "ESCALATE: <what's wrong and why you can't fix it>
|
||||
### When vault items were filed (P0-P2 unresolved)
|
||||
Note the vault items in the status summary:
|
||||
matrix_send "supervisor" "Supervisor health check:
|
||||
|
||||
Suggested action: <what the human should do>"
|
||||
Filed vault items:
|
||||
- vault/pending/<id>.md — <summary>
|
||||
|
||||
### Responding to escalation replies
|
||||
If you acted on a human's reply, confirm what you did:
|
||||
matrix_send "supervisor" "Acted on your reply: <summary of action taken>"
|
||||
Status: RAM=<X>MB Disk=<Y>% Load=<Z>"
|
||||
|
||||
Keep messages concise. Do not post identical messages to what was posted
|
||||
in the previous run (check journal for prior messages).
|
||||
|
|
@ -233,15 +233,15 @@ Format:
|
|||
- Docker: <N> containers
|
||||
|
||||
### Findings
|
||||
- [P<N>] <finding> — <action taken or "escalated">
|
||||
- [P<N>] <finding> — <action taken or "filed vault item">
|
||||
(or "No issues found — all systems healthy")
|
||||
|
||||
### Actions taken
|
||||
- <what was fixed>
|
||||
(or "No actions needed")
|
||||
|
||||
### Escalation replies processed
|
||||
- <human said X, did Y>
|
||||
### Vault items filed
|
||||
- vault/pending/<id>.md — <reason>
|
||||
(or "None")
|
||||
|
||||
Keep each entry concise — 15-25 lines max. This journal provides
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue