disinto/formulas/run-supervisor.toml
Claude f0c3c773ff
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
ci/woodpecker/pr/smoke-init Pipeline was successful
fix: tech-debt: sweep cron-isms from code comments, helpers, lib, and public site copy (#548)
- Rename acquire_cron_lock → acquire_run_lock in lib/formula-session.sh
  and all five *-run.sh call sites
- Update all *-run.sh file headers: "Cron wrapper" → "Polling-loop wrapper"
- Rewrite docs/updating-factory.md: replace crontab check with pgrep,
  replace "Crontab empty after restart" section with polling-loop equivalent
- Update docs/EVAL-MCP-SERVER.md to reflect polling-loop reality
- Update lib/guard.sh, lib/AGENTS.md, lib/ci-setup.sh comments
- Update formulas/*.toml comments (cron → polling loop)
- Update dev/dev-poll.sh usage comment
- Update tests/smoke-init.sh to handle compose vs bare-metal scheduling
- Update .woodpecker/agent-smoke.sh comments
- Update site HTML: architecture.html, quickstart.html, index.html
- Clarify _install_cron_impl is bare-metal only (compose uses polling loop)
- Keep site/collect-engagement.sh and site/collect-metrics.sh cron refs
  (genuinely cron-driven on the website host, separate from factory loop)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 08:54:11 +00:00

282 lines
10 KiB
TOML

# formulas/run-supervisor.toml — Supervisor formula (health monitoring + remediation)
#
# Executed by supervisor/supervisor-run.sh via polling loop (every 20 minutes).
# supervisor-run.sh runs claude -p via agent-sdk.sh and injects
# this formula with pre-collected metrics as context.
#
# Steps: preflight → health-assessment → decide-actions → report → journal
#
# Key differences from planner/gardener:
# - Runs every 20min — lightweight health check
# - Primarily READS state, rarely WRITES (no PRs, just journal)
# - Checks vault state for pending procurement items
# - Conversation memory via journal
name = "run-supervisor"
description = "Factory health monitoring: assess metrics, fix issues, write journal"
version = 1
model = "sonnet"
[context]
files = ["AGENTS.md"]
[[steps]]
id = "preflight"
title = "Review pre-collected metrics"
description = """
The pre-flight metrics have already been collected by supervisor/preflight.sh
and injected into your prompt above. Review them now.
1. Read the injected metrics data carefully (System Resources, Docker,
Active Sessions, Phase Files, Stale Phase Cleanup, Lock Files, Agent Logs,
CI Pipelines, Open PRs, Issue Status, Stale Worktrees).
Note: preflight.sh auto-removes PHASE:escalate files for closed issues
(24h grace period). Check the "Stale Phase Cleanup" section for any
files cleaned or in grace period this run.
2. Check vault state: read ${OPS_VAULT_ROOT:-$OPS_REPO_ROOT/vault/pending}/*.md for any procurement items
the planner has filed. Note items relevant to the health assessment
(e.g. a blocked resource that explains why the pipeline is stalled).
Note: In degraded mode, vault items are stored locally.
3. Read the supervisor journal for recent history:
JOURNAL_FILE="${OPS_JOURNAL_ROOT:-$OPS_REPO_ROOT/journal/supervisor}/$(date -u +%Y-%m-%d).md"
if [ -f "$JOURNAL_FILE" ]; then cat "$JOURNAL_FILE"; fi
Note: In degraded mode, the journal is stored locally and not committed to git.
4. Note any values that cross these thresholds:
- RAM available < 500MB or swap > 3GB → P0 (memory crisis)
- Disk > 80% → P1 (disk pressure)
- Agent sessions dead, CI stuck/pending, git in bad state → P2 (factory stopped)
- PRs stale >20min (CI done, no push since) → P3 (factory degraded)
- Stale worktrees, old lock files → P4 (housekeeping)
"""
[[steps]]
id = "health-assessment"
title = "Evaluate health of each subsystem"
description = """
Categorize every finding from the metrics into priority levels.
### P0 — Memory crisis
- RAM available < 500MB
- Swap used > 3GB AND RAM available < 2000MB
### P1 — Disk pressure
- Disk usage > 80%
### P2 — Factory stopped / stalled
- CI pipelines stuck running > 20min or pending > 30min
- Dev-agent lock file present but process dead
- Dev-agent status unchanged for > 30min
- Git repo on wrong branch or in broken rebase state
- Pipeline stalled: backlog issues exist but no agent ran for > 20min
- Dev-agent blocked: last N polls all report "no ready issues"
- Dev/action sessions in PHASE:escalate for > 24h (session timeout)
(Note: PHASE:escalate files for closed issues are auto-cleaned by preflight;
this check covers sessions where the issue is still open)
### P3 — Factory degraded
- PRs stale: CI finished >20min ago AND no git push to the PR branch since CI completed
(Do NOT flag PRs that are actively being worked on — only truly inactive ones)
- Circular dependency deadlocks in backlog
- Stale dependencies (blocked by issues open > 30 days)
### P4 — Housekeeping
- Stale worktrees > 2h old with no active process
- Lock files for dead processes
- Stale claude processes (> 3h old)
List each finding with its priority level. If everything looks healthy,
note "All systems healthy" and proceed.
"""
needs = ["preflight"]
[[steps]]
id = "decide-actions"
title = "Fix what you can, file vault items for what you cannot"
description = """
For each finding from the health assessment, decide and execute an action.
### Auto-fixable (execute these directly)
**P0 Memory crisis:**
# Kill stale one-shot claude processes (>3h old)
pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
# Drop filesystem caches
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null 2>&1 || true
**P1 Disk pressure:**
# First pass: dangling only (cheap, safe)
sudo docker system prune -f >/dev/null 2>&1 || true
# If still > 80%, escalate to all unused images (more aggressive but necessary)
_pct=$(df -h / | awk 'NR==2{print $5}' | tr -d '%')
if [ "${_pct:-0}" -gt 80 ]; then
sudo docker system prune -a -f >/dev/null 2>&1 || true
fi
# Truncate logs > 10MB
for f in "$FACTORY_ROOT"/{dev,review,supervisor,gardener,planner,predictor}/*.log; do
[ -f "$f" ] && [ "$(du -k "$f" | cut -f1)" -gt 10240 ] && truncate -s 0 "$f"
done
**P2 Dead lock files:**
rm -f /path/to/stale.lock
**P2 Stale rebase:**
cd "$PROJECT_REPO_ROOT"
git rebase --abort 2>/dev/null
git checkout "$PRIMARY_BRANCH" 2>/dev/null
**P2 Wrong branch:**
cd "$PROJECT_REPO_ROOT"
git checkout "$PRIMARY_BRANCH" 2>/dev/null
**P4 Stale PHASE:escalate files (closed issues):**
Already handled by preflight.sh auto-cleanup. Check "Stale Phase Cleanup"
in the metrics for results. Log any cleanups in the journal.
**P4 Stale worktrees:**
git -C "$PROJECT_REPO_ROOT" worktree remove --force /tmp/stale-worktree 2>/dev/null
git -C "$PROJECT_REPO_ROOT" worktree prune 2>/dev/null
**P4 Stale claude processes:**
pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
**P3 Stale PRs (CI done >20min, no push since):**
Do NOT read dev-poll.sh, push branches, attempt merges, or investigate pipeline code.
Instead, file a vault item for the dev-agent to pick up:
Write ${OPS_VAULT_ROOT:-$OPS_REPO_ROOT/vault/pending}/stale-pr-${ISSUE_NUM}.md:
# Stale PR: ${PR_TITLE}
## What
CI finished >20min ago but no git push has been made to the PR branch.
## Why
P3 — Factory degraded: PRs should be pushed within 20min of CI completion.
## Unblocks
- Factory health: dev-agent will push the branch and continue the workflow
Do NOT file vault items for stale PRs unless they remain stale for >3 consecutive runs.
### Cannot auto-fix → file vault item
For P0-P2 issues that persist after auto-fix attempts, or issues requiring
human judgment, file a vault procurement item:
Write ${OPS_VAULT_ROOT:-$OPS_REPO_ROOT/vault/pending}/supervisor-<issue-slug>.md:
# <What is needed>
## What
<description of the problem and why the supervisor cannot fix it>
## Why
<impact on factory health — reference the priority level>
## Unblocks
- Factory health: <what this resolves>
Vault PR filed on ops repo — human approves via PR review.
Note: In degraded mode (no ops repo), vault items are written locally to ${OPS_VAULT_ROOT:-local path}.
### Reading best-practices files
Read the relevant best-practices file before taking action. In degraded mode,
use the bundled knowledge files from ${OPS_KNOWLEDGE_ROOT:-$OPS_REPO_ROOT/knowledge}:
cat "${OPS_KNOWLEDGE_ROOT:-$OPS_REPO_ROOT/knowledge}/memory.md" # P0
cat "${OPS_KNOWLEDGE_ROOT:-$OPS_REPO_ROOT/knowledge}/disk.md" # P1
cat "${OPS_KNOWLEDGE_ROOT:-$OPS_REPO_ROOT/knowledge}/ci.md" # P2 CI
cat "${OPS_KNOWLEDGE_ROOT:-$OPS_REPO_ROOT/knowledge}/dev-agent.md" # P2 agent
cat "${OPS_KNOWLEDGE_ROOT:-$OPS_REPO_ROOT/knowledge}/git.md" # P2 git
cat "${OPS_KNOWLEDGE_ROOT:-$OPS_REPO_ROOT/knowledge}/review-agent.md" # P2 review
cat "${OPS_KNOWLEDGE_ROOT:-$OPS_REPO_ROOT/knowledge}/forge.md" # P2 forge
Note: If OPS_REPO_ROOT is not available (degraded mode), the bundled knowledge
files in ${OPS_KNOWLEDGE_ROOT:-<unset>} provide fallback guidance.
Track what you fixed and what vault items you filed for the report step.
"""
needs = ["health-assessment"]
[[steps]]
id = "report"
title = "Log health summary"
description = """
Log a status summary to the journal.
### When everything is healthy
Log a brief "all clear" only if the PREVIOUS run had alerts (check journal).
Do NOT log "all clear" every 20 minutes — that would be noise.
### When there are findings
Log a summary grouped by priority with:
Fixed:
- <what was auto-fixed>
Alerts:
- [P2] <description>
- [P3] <description>
Status: RAM=<X>MB Disk=<Y>% Load=<Z>
### When vault items were filed (P0-P2 unresolved)
Note the vault items in the status summary.
Keep messages concise. Do not log identical messages to what was logged
in the previous run (check journal for prior messages).
"""
needs = ["decide-actions"]
[[steps]]
id = "journal"
title = "Write health journal entry"
description = """
Append a timestamped entry to the supervisor journal.
File path:
${OPS_JOURNAL_ROOT:-$OPS_REPO_ROOT/journal/supervisor}/$(date -u +%Y-%m-%d).md
If the file already exists (multiple runs per day), append a new section.
If it does not exist, create it.
Format:
## Supervisor run — HH:MM UTC
### Health status
- RAM: <X>MB available, Swap: <X>MB
- Disk: <X>%
- Load: <X>
- Docker: <N> containers
### Findings
- [P<N>] <finding> — <action taken or "filed vault item">
(or "No issues found all systems healthy")
### Actions taken
- <what was fixed>
(or "No actions needed")
### Vault items filed
- vault/pending/<id>.md — <reason>
(or "None")
Keep each entry concise — 15-25 lines max. This journal provides
run-to-run context so future supervisor runs can detect trends
(e.g., "disk has been >75% for 3 consecutive runs").
IMPORTANT: Do NOT commit or push the journal — it is a local working file.
The journal directory is committed to git periodically by other agents.
Note: In degraded mode (no ops repo), the journal is written locally to
${OPS_JOURNAL_ROOT:-<unset>} and is NOT automatically committed to any repo.
## Learning
If you discover something new during this run:
- In full mode (ops repo available): append to the relevant knowledge file:
echo "### Lesson title
Description of what you learned." >> "${OPS_REPO_ROOT}/knowledge/<file>.md"
- In degraded mode: write to the local knowledge directory for reference:
echo "### Lesson title
Description of what you learned." >> "${OPS_KNOWLEDGE_ROOT:-<unset>}/<file>.md"
Knowledge files: memory.md, disk.md, ci.md, forge.md, dev-agent.md,
review-agent.md, git.md.
After writing the journal, the agent session completes automatically.
"""
needs = ["report"]