Add OPS repo presence detection in supervisor-run.sh with degraded mode support: - Detect if OPS_REPO_ROOT is missing and log WARNING message - Set OPS_REPO_DEGRADED=1 flag and configure fallback paths - Bundle minimal knowledge files as fallback for degraded mode - Update formula to use OPS_KNOWLEDGE_ROOT, OPS_JOURNAL_ROOT, OPS_VAULT_ROOT - Support local vault destination and journal fallback when ops repo absent Knowledge files bundled: disk.md, memory.md, ci.md, git.md, dev-agent.md, review-agent.md, forge.md The supervisor now runs with full functionality when ops repo is available, or gracefully degrades to local paths when absent, making the failure mode explicit rather than silent.
3.3 KiB
Supervisor Agent
Role: Health monitoring and auto-remediation, executed as a formula-driven Claude agent. Collects system and project metrics via a bash pre-flight script, then runs an interactive Claude session (sonnet) that assesses health, auto-fixes issues, and writes a daily journal. When blocked on external resources or human decisions, files vault items instead of escalating directly.
Trigger: supervisor-run.sh runs every 20 min via cron. Sources lib/guard.sh
and calls check_active supervisor first — skips if
$FACTORY_ROOT/state/.supervisor-active is absent. Then runs claude -p
via agent-sdk.sh, injects formulas/run-supervisor.toml with
pre-collected metrics as context, and cleans up on completion or timeout (20 min max session).
No action issues — the supervisor runs directly from cron like the planner and predictor.
Key files:
supervisor/supervisor-run.sh— Cron wrapper + orchestrator: lock, memory guard, runs preflight.sh, sources disinto project config, runs claude -p via agent-sdk.sh, injects formula prompt with metrics, handles crash recoverysupervisor/preflight.sh— Data collection: system resources (RAM, disk, swap, load), Docker status, active sessions + phase files, lock files, agent log tails, CI pipeline status, open PRs, issue counts, stale worktrees, blocked issues. Also performs stale phase cleanup: scans/tmp/*-session-*.phasefiles forPHASE:escalateentries and auto-removes any whose linked issue is confirmed closed (24h grace period after closure to avoid races). Reports stale crashed worktrees (worktrees preserved after crash) — supervisor housekeeping removes them after 24hformulas/run-supervisor.toml— Execution spec: five steps (preflight review, health-assessment, decide-actions, report, journal) withneedsdependencies. Claude evaluates all metrics and takes actions in a single interactive session$OPS_REPO_ROOT/knowledge/*.md— Domain-specific remediation guides (memory, disk, CI, git, dev-agent, review-agent, forge)
Alert priorities: P0 (memory crisis), P1 (disk), P2 (factory stopped/stalled), P3 (degraded PRs, circular deps, stale deps), P4 (housekeeping).
Environment variables consumed:
FORGE_TOKEN,FORGE_SUPERVISOR_TOKEN(falls back to FORGE_TOKEN),FORGE_REPO,FORGE_API,PROJECT_NAME,PROJECT_REPO_ROOT,OPS_REPO_ROOTPRIMARY_BRANCH,CLAUDE_MODEL(set to sonnet by supervisor-run.sh)WOODPECKER_TOKEN,WOODPECKER_SERVER,WOODPECKER_DB_PASSWORD,WOODPECKER_DB_USER,WOODPECKER_DB_HOST,WOODPECKER_DB_NAME— CI database queries
Degraded mode (Issue #544): When OPS_REPO_ROOT is not set or the directory doesn't exist, the supervisor runs in degraded mode:
- Uses bundled knowledge files from
$FACTORY_ROOT/knowledge/instead of ops repo playbooks - Writes journal locally to
$FACTORY_ROOT/state/supervisor-journal/(not committed to git) - Files vault items locally to
$PROJECT_REPO_ROOT/vault/pending/ - Logs a WARNING message at startup indicating degraded mode
Lifecycle: supervisor-run.sh (cron */20) → lock + memory guard → run
preflight.sh (collect metrics) → load formula + context → run claude -p via agent-sdk.sh
→ Claude assesses health, auto-fixes, writes journal → PHASE:done.