disinto/supervisor/AGENTS.md
Claude f299bae77b
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
fix: bug: supervisor hardcodes ops repo expectation — fails silently on deployments without one (#544)
Add OPS repo presence detection in supervisor-run.sh with degraded mode support:
- Detect if OPS_REPO_ROOT is missing and log WARNING message
- Set OPS_REPO_DEGRADED=1 flag and configure fallback paths
- Bundle minimal knowledge files as fallback for degraded mode
- Update formula to use OPS_KNOWLEDGE_ROOT, OPS_JOURNAL_ROOT, OPS_VAULT_ROOT
- Support local vault destination and journal fallback when ops repo absent

Knowledge files bundled: disk.md, memory.md, ci.md, git.md, dev-agent.md,
review-agent.md, forge.md

The supervisor now runs with full functionality when ops repo is available,
or gracefully degrades to local paths when absent, making the failure mode
explicit rather than silent.
2026-04-10 08:16:03 +00:00

3.3 KiB

Supervisor Agent

Role: Health monitoring and auto-remediation, executed as a formula-driven Claude agent. Collects system and project metrics via a bash pre-flight script, then runs an interactive Claude session (sonnet) that assesses health, auto-fixes issues, and writes a daily journal. When blocked on external resources or human decisions, files vault items instead of escalating directly.

Trigger: supervisor-run.sh runs every 20 min via cron. Sources lib/guard.sh and calls check_active supervisor first — skips if $FACTORY_ROOT/state/.supervisor-active is absent. Then runs claude -p via agent-sdk.sh, injects formulas/run-supervisor.toml with pre-collected metrics as context, and cleans up on completion or timeout (20 min max session). No action issues — the supervisor runs directly from cron like the planner and predictor.

Key files:

  • supervisor/supervisor-run.sh — Cron wrapper + orchestrator: lock, memory guard, runs preflight.sh, sources disinto project config, runs claude -p via agent-sdk.sh, injects formula prompt with metrics, handles crash recovery
  • supervisor/preflight.sh — Data collection: system resources (RAM, disk, swap, load), Docker status, active sessions + phase files, lock files, agent log tails, CI pipeline status, open PRs, issue counts, stale worktrees, blocked issues. Also performs stale phase cleanup: scans /tmp/*-session-*.phase files for PHASE:escalate entries and auto-removes any whose linked issue is confirmed closed (24h grace period after closure to avoid races). Reports stale crashed worktrees (worktrees preserved after crash) — supervisor housekeeping removes them after 24h
  • formulas/run-supervisor.toml — Execution spec: five steps (preflight review, health-assessment, decide-actions, report, journal) with needs dependencies. Claude evaluates all metrics and takes actions in a single interactive session
  • $OPS_REPO_ROOT/knowledge/*.md — Domain-specific remediation guides (memory, disk, CI, git, dev-agent, review-agent, forge)

Alert priorities: P0 (memory crisis), P1 (disk), P2 (factory stopped/stalled), P3 (degraded PRs, circular deps, stale deps), P4 (housekeeping).

Environment variables consumed:

  • FORGE_TOKEN, FORGE_SUPERVISOR_TOKEN (falls back to FORGE_TOKEN), FORGE_REPO, FORGE_API, PROJECT_NAME, PROJECT_REPO_ROOT, OPS_REPO_ROOT
  • PRIMARY_BRANCH, CLAUDE_MODEL (set to sonnet by supervisor-run.sh)
  • WOODPECKER_TOKEN, WOODPECKER_SERVER, WOODPECKER_DB_PASSWORD, WOODPECKER_DB_USER, WOODPECKER_DB_HOST, WOODPECKER_DB_NAME — CI database queries

Degraded mode (Issue #544): When OPS_REPO_ROOT is not set or the directory doesn't exist, the supervisor runs in degraded mode:

  • Uses bundled knowledge files from $FACTORY_ROOT/knowledge/ instead of ops repo playbooks
  • Writes journal locally to $FACTORY_ROOT/state/supervisor-journal/ (not committed to git)
  • Files vault items locally to $PROJECT_REPO_ROOT/vault/pending/
  • Logs a WARNING message at startup indicating degraded mode

Lifecycle: supervisor-run.sh (cron */20) → lock + memory guard → run preflight.sh (collect metrics) → load formula + context → run claude -p via agent-sdk.sh → Claude assesses health, auto-fixes, writes journal → PHASE:done.