refactor: rename factory/ → supervisor/, factory-poll → supervisor-poll

The supervisor agent was confusingly named "factory" (same as the
project). Rename directory, script, log, lock, status, and escalation
files. Update all references across scripts and docs.

FACTORY_ROOT env var unchanged (refers to project root, not agent).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
johba 2026-03-15 18:06:25 +01:00
parent 8d73c2f8f9
commit 77cb4c4643
15 changed files with 68 additions and 68 deletions

View file

@ -55,7 +55,7 @@ CLAUDE_TIMEOUT=7200 # seconds per Claude invocation
### Required: CI pipeline ### Required: CI pipeline
The repo needs at least one Woodpecker pipeline. Dark-factory monitors CI status to decide when a PR is ready for review and when it can merge. The repo needs at least one Woodpecker pipeline. Disinto monitors CI status to decide when a PR is ready for review and when it can merge.
### Required: `CLAUDE.md` ### Required: `CLAUDE.md`
@ -155,7 +155,7 @@ Add (adjust paths):
FACTORY_ROOT=/home/you/disinto FACTORY_ROOT=/home/you/disinto
# Supervisor — health checks, auto-healing (every 10 min) # Supervisor — health checks, auto-healing (every 10 min)
0,10,20,30,40,50 * * * * $FACTORY_ROOT/factory/factory-poll.sh 0,10,20,30,40,50 * * * * $FACTORY_ROOT/supervisor/supervisor-poll.sh
# Review agent — find unreviewed PRs (every 10 min, offset +3) # Review agent — find unreviewed PRs (every 10 min, offset +3)
3,13,23,33,43,53 * * * * $FACTORY_ROOT/review/review-poll.sh 3,13,23,33,43,53 * * * * $FACTORY_ROOT/review/review-poll.sh
@ -176,7 +176,7 @@ The 3-minute offsets prevent agents from competing for resources.
```bash ```bash
# Should complete with "all clear" (no problems to fix) # Should complete with "all clear" (no problems to fix)
bash factory/factory-poll.sh bash supervisor/supervisor-poll.sh
# Should list backlog issues (or "no backlog issues") # Should list backlog issues (or "no backlog issues")
bash dev/dev-poll.sh bash dev/dev-poll.sh
@ -188,7 +188,7 @@ bash review/review-poll.sh
Check logs after a few cycles: Check logs after a few cycles:
```bash ```bash
tail -30 factory/factory.log tail -30 supervisor/supervisor.log
tail -30 dev/dev-agent.log tail -30 dev/dev-agent.log
tail -30 review/review.log tail -30 review/review.log
``` ```
@ -203,7 +203,7 @@ If you want real-time notifications and human-in-the-loop escalation:
sudo cp lib/matrix_listener.service /etc/systemd/system/ sudo cp lib/matrix_listener.service /etc/systemd/system/
sudo systemctl enable --now matrix_listener sudo systemctl enable --now matrix_listener
``` ```
3. The factory and gardener will post status updates and escalation threads to the configured room. Reply in-thread to answer escalations. 3. The supervisor and gardener will post status updates and escalation threads to the configured room. Reply in-thread to answer escalations.
## Lifecycle ## Lifecycle
@ -219,7 +219,7 @@ You write issues (with backlog label)
→ merge, close issue, clean up → merge, close issue, clean up
Meanwhile: Meanwhile:
factory-poll monitors health, kills stale processes, manages resources supervisor-poll monitors health, kills stale processes, manages resources
gardener grooms backlog: closes duplicates, promotes tech-debt, escalates ambiguity gardener grooms backlog: closes duplicates, promotes tech-debt, escalates ambiguity
planner rebuilds AGENTS.md from git history, gap-analyses against VISION.md planner rebuilds AGENTS.md from git history, gap-analyses against VISION.md
``` ```
@ -233,4 +233,4 @@ Meanwhile:
| CI stuck | `bash lib/ci-debug.sh` — check Woodpecker. Rate-limited? (exit 128 = wait 15 min) | | CI stuck | `bash lib/ci-debug.sh` — check Woodpecker. Rate-limited? (exit 128 = wait 15 min) |
| Claude not found | `which claude` — must be in PATH. Check `lib/env.sh` adds `~/.local/bin`. | | Claude not found | `which claude` — must be in PATH. Check `lib/env.sh` adds `~/.local/bin`. |
| Merge fails | Branch protection misconfigured? Review bot needs write access to the repo. | | Merge fails | Branch protection misconfigured? Review bot needs write access to the repo. |
| Memory issues | Factory auto-heals at <500 MB free. Check `factory/factory.log` for P0 alerts. | | Memory issues | Supervisor auto-heals at <500 MB free. Check `supervisor/supervisor.log` for P0 alerts. |

View file

@ -9,7 +9,7 @@ Point it at a Codeberg repo with a Woodpecker CI pipeline and it will pick up is
## Architecture ## Architecture
``` ```
cron (*/10) ──→ factory-poll.sh ← supervisor (bash checks, zero tokens) cron (*/10) ──→ supervisor-poll.sh ← supervisor (bash checks, zero tokens)
├── all clear? → exit 0 ├── all clear? → exit 0
└── problem? → claude -p (diagnose, fix, or escalate) └── problem? → claude -p (diagnose, fix, or escalate)
@ -33,9 +33,9 @@ all agents ──→ matrix_send() ← status updates, escalations, merge no
**Required:** **Required:**
- [Claude CLI](https://docs.anthropic.com/en/docs/claude-cli) — `claude` in PATH, authenticated - [Claude CLI](https://docs.anthropic.com/en/docs/claude-cli) — `claude` in PATH, authenticated
- [Codeberg](https://codeberg.org/) account with an API token — the factory reads issues, opens PRs, posts comments, and merges via the Codeberg API - [Codeberg](https://codeberg.org/) account with an API token — disinto reads issues, opens PRs, posts comments, and merges via the Codeberg API
- A second Codeberg account for the review bot — reviews posted under a separate identity so the dev-agent doesn't review its own PRs (`REVIEW_BOT_TOKEN`) - A second Codeberg account for the review bot — reviews posted under a separate identity so the dev-agent doesn't review its own PRs (`REVIEW_BOT_TOKEN`)
- [Woodpecker CI](https://woodpecker-ci.org/) — local instance connected to your Codeberg repo; the factory monitors pipelines, retries failures, and queries the Woodpecker Postgres DB directly - [Woodpecker CI](https://woodpecker-ci.org/) — local instance connected to your Codeberg repo; disinto monitors pipelines, retries failures, and queries the Woodpecker Postgres DB directly
- PostgreSQL client (`psql`) — for Woodpecker DB queries (pipeline status, build counts) - PostgreSQL client (`psql`) — for Woodpecker DB queries (pipeline status, build counts)
- `jq`, `curl`, `git` - `jq`, `curl`, `git`
@ -84,13 +84,13 @@ CLAUDE_TIMEOUT=7200 # max seconds per Claude invocation (default: 2h)
# 3. Install cron (staggered to avoid overlap) # 3. Install cron (staggered to avoid overlap)
crontab -e crontab -e
# Add: # Add:
# 0,10,20,30,40,50 * * * * /path/to/disinto/factory/factory-poll.sh # 0,10,20,30,40,50 * * * * /path/to/disinto/supervisor/supervisor-poll.sh
# 3,13,23,33,43,53 * * * * /path/to/disinto/review/review-poll.sh # 3,13,23,33,43,53 * * * * /path/to/disinto/review/review-poll.sh
# 6,16,26,36,46,56 * * * * /path/to/disinto/dev/dev-poll.sh # 6,16,26,36,46,56 * * * * /path/to/disinto/dev/dev-poll.sh
# 15 8 * * * /path/to/disinto/gardener/gardener-poll.sh # 15 8 * * * /path/to/disinto/gardener/gardener-poll.sh
# 4. Verify # 4. Verify
bash factory/factory-poll.sh # should log "all clear" bash supervisor/supervisor-poll.sh # should log "all clear"
``` ```
## Directory Structure ## Directory Structure
@ -113,8 +113,8 @@ disinto/
├── gardener/ ├── gardener/
│ ├── gardener-poll.sh # Cron entry: backlog grooming │ ├── gardener-poll.sh # Cron entry: backlog grooming
│ └── best-practices.md # Gardener knowledge base │ └── best-practices.md # Gardener knowledge base
└── factory/ └── supervisor/
├── factory-poll.sh # Supervisor: health checks + claude -p ├── supervisor-poll.sh # Supervisor: health checks + claude -p
├── PROMPT.md # Supervisor's system prompt ├── PROMPT.md # Supervisor's system prompt
├── update-prompt.sh # Self-learning: append to best-practices ├── update-prompt.sh # Self-learning: append to best-practices
└── best-practices/ # Progressive disclosure knowledge base └── best-practices/ # Progressive disclosure knowledge base
@ -131,7 +131,7 @@ disinto/
| Agent | Trigger | Job | | Agent | Trigger | Job |
|-------|---------|-----| |-------|---------|-----|
| **Factory** (supervisor) | Every 10 min | Health checks (RAM, disk, CI, git). Calls Claude only when something is broken. Self-improving via `best-practices/`. | | **Supervisor** | Every 10 min | Health checks (RAM, disk, CI, git). Calls Claude only when something is broken. Self-improving via `best-practices/`. |
| **Dev** | Every 10 min | Picks up `backlog`-labeled issues, creates a branch, implements, opens a PR, monitors CI, responds to review, merges. | | **Dev** | Every 10 min | Picks up `backlog`-labeled issues, creates a branch, implements, opens a PR, monitors CI, responds to review, merges. |
| **Review** | Every 10 min | Finds PRs without review, runs Claude-powered code review, approves or requests changes. | | **Review** | Every 10 min | Finds PRs without review, runs Claude-powered code review, approves or requests changes. |
| **Gardener** | Daily | Grooms the issue backlog: detects duplicates, promotes `tech-debt` to `backlog`, closes stale issues, escalates ambiguous items. | | **Gardener** | Daily | Grooms the issue backlog: detects duplicates, promotes `tech-debt` to `backlog`, closes stale issues, escalates ambiguous items. |

View file

@ -1106,9 +1106,9 @@ while [ "$REVIEW_ROUND" -lt "$MAX_REVIEW_ROUNDS" ]; do
CI_FIX_COUNT=$(( ${CI_FIX_COUNT:-0} + 1 )) CI_FIX_COUNT=$(( ${CI_FIX_COUNT:-0} + 1 ))
if [ "$CI_FIX_COUNT" -gt 2 ]; then if [ "$CI_FIX_COUNT" -gt 2 ]; then
log "CI failure not recoverable after ${CI_FIX_COUNT} fix attempts" log "CI failure not recoverable after ${CI_FIX_COUNT} fix attempts"
# Escalate to supervisor — write marker for factory-poll.sh to pick up # Escalate to supervisor — write marker for supervisor-poll.sh to pick up
echo "{\"issue\":${ISSUE},\"pr\":${PR_NUMBER},\"reason\":\"ci_exhausted\",\"step\":\"${FAILED_STEP:-unknown}\",\"attempts\":${CI_FIX_COUNT},\"ts\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" \ echo "{\"issue\":${ISSUE},\"pr\":${PR_NUMBER},\"reason\":\"ci_exhausted\",\"step\":\"${FAILED_STEP:-unknown}\",\"attempts\":${CI_FIX_COUNT},\"ts\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" \
>> "${FACTORY_ROOT}/factory/escalations.jsonl" >> "${FACTORY_ROOT}/supervisor/escalations.jsonl"
log "escalated to supervisor via escalations.jsonl" log "escalated to supervisor via escalations.jsonl"
break break
fi fi

View file

@ -1,5 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# dev-poll.sh — Pull-based factory: find the next ready issue and start dev-agent # dev-poll.sh — Pull-based scheduler: find the next ready issue and start dev-agent
# #
# Pull system: issues labeled "backlog" are candidates. An issue is READY when # Pull system: issues labeled "backlog" are candidates. An issue is READY when
# ALL its dependency issues are closed (and their PRs merged). # ALL its dependency issues are closed (and their PRs merged).
@ -104,7 +104,7 @@ dep_is_merged() {
return 1 return 1
fi fi
# Issue closed = dep satisfied. The factory only closes issues after # Issue closed = dep satisfied. The scheduler only closes issues after
# merging, so closed state is trustworthy. No need to hunt for the # merging, so closed state is trustworthy. No need to hunt for the
# specific PR — that was over-engineering that caused false negatives. # specific PR — that was over-engineering that caused false negatives.
return 0 return 0

View file

@ -1,11 +1,11 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# matrix_listener.sh — Long-poll Matrix sync daemon # matrix_listener.sh — Long-poll Matrix sync daemon
# #
# Listens for replies in the factory Matrix room and dispatches them # Listens for replies in the Matrix coordination room and dispatches them
# to the appropriate agent via well-known files. # to the appropriate agent via well-known files.
# #
# Dispatch: # Dispatch:
# Thread reply to [supervisor] message → /tmp/factory-escalation-reply # Thread reply to [supervisor] message → /tmp/supervisor-escalation-reply
# Thread reply to [gardener] message → /tmp/gardener-escalation-reply # Thread reply to [gardener] message → /tmp/gardener-escalation-reply
# #
# Run as systemd service (see matrix_listener.service) or manually: # Run as systemd service (see matrix_listener.service) or manually:
@ -18,7 +18,7 @@ source "$(dirname "$0")/../lib/env.sh"
SINCE_FILE="/tmp/matrix-listener-since" SINCE_FILE="/tmp/matrix-listener-since"
THREAD_MAP="${MATRIX_THREAD_MAP:-/tmp/matrix-thread-map}" THREAD_MAP="${MATRIX_THREAD_MAP:-/tmp/matrix-thread-map}"
LOGFILE="${FACTORY_ROOT}/factory/matrix-listener.log" LOGFILE="${FACTORY_ROOT}/supervisor/matrix-listener.log"
SYNC_TIMEOUT=30000 # 30s long-poll SYNC_TIMEOUT=30000 # 30s long-poll
BACKOFF=5 BACKOFF=5
MAX_BACKOFF=60 MAX_BACKOFF=60
@ -133,7 +133,7 @@ while true; do
case "$AGENT" in case "$AGENT" in
supervisor) supervisor)
printf '%s\t%s\t%s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$SENDER" "$BODY" >> /tmp/factory-escalation-reply printf '%s\t%s\t%s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$SENDER" "$BODY" >> /tmp/supervisor-escalation-reply
# Acknowledge # Acknowledge
matrix_send "supervisor" "✓ received, will act on next poll" "$THREAD_ROOT" >/dev/null 2>&1 || true matrix_send "supervisor" "✓ received, will act on next poll" "$THREAD_ROOT" >/dev/null 2>&1 || true
;; ;;

View file

@ -1,7 +1,7 @@
# Factory Supervisor # Supervisor Agent
You are the factory supervisor for `$CODEBERG_REPO`. You were called because You are the supervisor agent for `$CODEBERG_REPO`. You were called because
`factory-poll.sh` detected an issue it couldn't auto-fix. `supervisor-poll.sh` detected an issue it couldn't auto-fix.
## Priority Order ## Priority Order
@ -16,13 +16,13 @@ You are the factory supervisor for `$CODEBERG_REPO`. You were called because
Fix the issue yourself. You have full shell access and `--dangerously-skip-permissions`. Fix the issue yourself. You have full shell access and `--dangerously-skip-permissions`.
Before acting, read the relevant best-practices file: Before acting, read the relevant best-practices file:
- Memory issues → `cat ${FACTORY_ROOT}/factory/best-practices/memory.md` - Memory issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/memory.md`
- Disk issues → `cat ${FACTORY_ROOT}/factory/best-practices/disk.md` - Disk issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/disk.md`
- CI issues → `cat ${FACTORY_ROOT}/factory/best-practices/ci.md` - CI issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/ci.md`
- Codeberg / rate limits → `cat ${FACTORY_ROOT}/factory/best-practices/codeberg.md` - Codeberg / rate limits → `cat ${FACTORY_ROOT}/supervisor/best-practices/codeberg.md`
- Dev-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/dev-agent.md` - Dev-agent issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/dev-agent.md`
- Review-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/review-agent.md` - Review-agent issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/review-agent.md`
- Git issues → `cat ${FACTORY_ROOT}/factory/best-practices/git.md` - Git issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/git.md`
## Credentials & API Access ## Credentials & API Access
@ -66,6 +66,6 @@ ESCALATE: <what's wrong>
If you discover something new, append it to the relevant best-practices file: If you discover something new, append it to the relevant best-practices file:
```bash ```bash
bash ${FACTORY_ROOT}/factory/update-prompt.sh "best-practices/<file>.md" "### Lesson title bash ${FACTORY_ROOT}/supervisor/update-prompt.sh "best-practices/<file>.md" "### Lesson title
Description of what you learned." Description of what you learned."
``` ```

View file

@ -22,7 +22,7 @@ cd <worktree> && git commit --allow-empty -m "ci: retrigger" --no-verify && git
``` ```
### Prevention ### Prevention
- The factory runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously. - The system runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously.
- One pipeline at a time is ideal on this VPS (resource + rate limit reasons). - One pipeline at a time is ideal on this VPS (resource + rate limit reasons).
- If >3 pipelines are pending/running, do NOT create more work. - If >3 pipelines are pending/running, do NOT create more work.

View file

@ -44,12 +44,12 @@ DO NOT try to find the specific PR that closed an issue. This is over-engineerin
- Codeberg shares issue/PR numbering — no guaranteed relationship - Codeberg shares issue/PR numbering — no guaranteed relationship
- PRs don't always mention the issue number in title/body - PRs don't always mention the issue number in title/body
- Searching last N closed PRs misses older merges - Searching last N closed PRs misses older merges
- The factory itself closes issues after merging, so closed = merged - The dev-agent closes issues after merging, so closed = merged
The only check needed: `issue.state == "closed"`. The only check needed: `issue.state == "closed"`.
### False Positive: Status Unchanged Alert ### False Positive: Status Unchanged Alert
The factory-poll alert 'status unchanged for Nmin' is a false positive for complex implementation tasks. The status is set to 'claude assessing + implementing' at the START of the `timeout 7200 claude -p ...` call and only updates after Claude finishes. Normal complex tasks (multi-file Solidity changes + forge test) take 45-90 minutes. To distinguish a false positive from a real stuck agent: check that the claude PID is alive (`ps -p <PID>`), consuming CPU (>0%), and has active threads (`pstree -p <PID>`). If the process is alive and using CPU, do NOT restart it — this wastes completed work. The supervisor-poll alert 'status unchanged for Nmin' is a false positive for complex implementation tasks. The status is set to 'claude assessing + implementing' at the START of the `timeout 7200 claude -p ...` call and only updates after Claude finishes. Normal complex tasks (multi-file Solidity changes + forge test) take 45-90 minutes. To distinguish a false positive from a real stuck agent: check that the claude PID is alive (`ps -p <PID>`), consuming CPU (>0%), and has active threads (`pstree -p <PID>`). If the process is alive and using CPU, do NOT restart it — this wastes completed work.
### False Positive: 'Waiting for CI + Review' Alert ### False Positive: 'Waiting for CI + Review' Alert
The 'status unchanged for Nmin' alert is also a false positive when status is 'waiting for CI + review on PR #N (round R)'. This is an intentional sleep/poll loop — the agent is waiting for CI to pass and then for review-poll to post a review. CI can take 2040 minutes; review follows. Do NOT restart the agent. Confirm by checking: (1) agent PID is alive, (2) CI commit status via `codeberg_api GET /commits/<sha>/status`, (3) review-poll log shows it will pick up the PR on next cycle. The 'status unchanged for Nmin' alert is also a false positive when status is 'waiting for CI + review on PR #N (round R)'. This is an intentional sleep/poll loop — the agent is waiting for CI to pass and then for review-poll to post a review. CI can take 2040 minutes; review follows. Do NOT restart the agent. Confirm by checking: (1) agent PID is alive, (2) CI commit status via `codeberg_api GET /commits/<sha>/status`, (3) review-poll log shows it will pick up the PR on next cycle.

View file

@ -2,7 +2,7 @@
## Safe Fixes ## Safe Fixes
- Docker cleanup: `sudo docker system prune -f` (keeps images, removes stopped containers + dangling layers) - Docker cleanup: `sudo docker system prune -f` (keeps images, removes stopped containers + dangling layers)
- Truncate factory logs >5MB: `truncate -s 0 <file>` - Truncate supervisor logs >5MB: `truncate -s 0 <file>`
- Remove stale worktrees: check `/tmp/${PROJECT_NAME}-worktree-*`, only if dev-agent not running on them - Remove stale worktrees: check `/tmp/${PROJECT_NAME}-worktree-*`, only if dev-agent not running on them
- Woodpecker log_entries: `DELETE FROM log_entries WHERE id < (SELECT max(id) - 100000 FROM log_entries);` then `VACUUM;` - Woodpecker log_entries: `DELETE FROM log_entries WHERE id < (SELECT max(id) - 100000 FROM log_entries);` then `VACUUM;`
- Node module caches in worktrees: `rm -rf /tmp/${PROJECT_NAME}-worktree-*/node_modules/` - Node module caches in worktrees: `rm -rf /tmp/${PROJECT_NAME}-worktree-*/node_modules/`

View file

@ -34,7 +34,7 @@
## Known Issues ## Known Issues
- Main repo MUST be on $PRIMARY_BRANCH at all times. Dev work happens in worktrees. - Main repo MUST be on $PRIMARY_BRANCH at all times. Dev work happens in worktrees.
- Stale rebases (detached HEAD) break all worktree creation — silent factory stall. - Stale rebases (detached HEAD) break all worktree creation — silent pipeline stall.
- `git worktree add` fails if target directory exists (even empty). Remove first. - `git worktree add` fails if target directory exists (even empty). Remove first.
- Many old branches exist locally (100+). Normal — don't bulk-delete. - Many old branches exist locally (100+). Normal — don't bulk-delete.
@ -47,7 +47,7 @@
## Lessons Learned ## Lessons Learned
- NEVER delete remote branches before confirming merge. Close PR, rebase locally, force-push if needed. - NEVER delete remote branches before confirming merge. Close PR, rebase locally, force-push if needed.
- Stale rebase caused 5h factory stall once (2026-03-11). Auto-heal added to dev-agent. - Stale rebase caused 5h pipeline stall once (2026-03-11). Auto-heal added to dev-agent.
- lint-staged hooks fail when `forge` not in PATH. Use `--no-verify` when committing from scripts. - lint-staged hooks fail when `forge` not in PATH. Use `--no-verify` when committing from scripts.
### PR #608 Post-Mortem (2026-03-12/13) ### PR #608 Post-Mortem (2026-03-12/13)

View file

@ -19,7 +19,7 @@
- **Hallucinated findings** — bot may flag non-issues. This needs Clawy's judgment — escalate. - **Hallucinated findings** — bot may flag non-issues. This needs Clawy's judgment — escalate.
## Monitoring ## Monitoring
- Unreviewed PRs with CI pass for >1h → factory-poll.sh auto-triggers review - Unreviewed PRs with CI pass for >1h → supervisor-poll.sh auto-triggers review
- Review errors should resolve on next poll cycle - Review errors should resolve on next poll cycle
- If same PR fails review 3+ times → likely a prompt issue, escalate - If same PR fails review 3+ times → likely a prompt issue, escalate

View file

@ -1,20 +1,20 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# factory-poll.sh — Factory supervisor: bash checks + claude -p for fixes # supervisor-poll.sh — Supervisor agent: bash checks + claude -p for fixes
# #
# Runs every 10min via cron. Does all health checks in bash (zero tokens). # Runs every 10min via cron. Does all health checks in bash (zero tokens).
# Only invokes claude -p when auto-fix fails or issue is complex. # Only invokes claude -p when auto-fix fails or issue is complex.
# #
# Cron: */10 * * * * /path/to/disinto/factory/factory-poll.sh # Cron: */10 * * * * /path/to/disinto/supervisor/supervisor-poll.sh
# #
# Peek: cat /tmp/factory-status # Peek: cat /tmp/supervisor-status
# Log: tail -f /path/to/disinto/factory/factory.log # Log: tail -f /path/to/disinto/supervisor/supervisor.log
source "$(dirname "$0")/../lib/env.sh" source "$(dirname "$0")/../lib/env.sh"
LOGFILE="${FACTORY_ROOT}/factory/factory.log" LOGFILE="${FACTORY_ROOT}/supervisor/supervisor.log"
STATUSFILE="/tmp/factory-status" STATUSFILE="/tmp/supervisor-status"
LOCKFILE="/tmp/factory-poll.lock" LOCKFILE="/tmp/supervisor-poll.lock"
PROMPT_FILE="${FACTORY_ROOT}/factory/PROMPT.md" PROMPT_FILE="${FACTORY_ROOT}/supervisor/PROMPT.md"
# Prevent overlapping runs # Prevent overlapping runs
if [ -f "$LOCKFILE" ]; then if [ -f "$LOCKFILE" ]; then
@ -32,15 +32,15 @@ flog() {
} }
status() { status() {
printf '[%s] factory: %s\n' "$(date -u '+%Y-%m-%d %H:%M:%S UTC')" "$*" > "$STATUSFILE" printf '[%s] supervisor: %s\n' "$(date -u '+%Y-%m-%d %H:%M:%S UTC')" "$*" > "$STATUSFILE"
flog "$*" flog "$*"
} }
# ── Check for escalation replies from Matrix ────────────────────────────── # ── Check for escalation replies from Matrix ──────────────────────────────
ESCALATION_REPLY="" ESCALATION_REPLY=""
if [ -s /tmp/factory-escalation-reply ]; then if [ -s /tmp/supervisor-escalation-reply ]; then
ESCALATION_REPLY=$(cat /tmp/factory-escalation-reply) ESCALATION_REPLY=$(cat /tmp/supervisor-escalation-reply)
rm -f /tmp/factory-escalation-reply rm -f /tmp/supervisor-escalation-reply
flog "Got escalation reply: $(echo "$ESCALATION_REPLY" | head -1)" flog "Got escalation reply: $(echo "$ESCALATION_REPLY" | head -1)"
fi fi
@ -71,7 +71,7 @@ SWAP_USED_MB=$(free -m | awk '/Swap:/{print $3}')
if [ "${AVAIL_MB:-9999}" -lt 500 ] || { [ "${SWAP_USED_MB:-0}" -gt 3000 ] && [ "${AVAIL_MB:-9999}" -lt 2000 ]; }; then if [ "${AVAIL_MB:-9999}" -lt 500 ] || { [ "${SWAP_USED_MB:-0}" -gt 3000 ] && [ "${AVAIL_MB:-9999}" -lt 2000 ]; }; then
flog "MEMORY CRISIS: avail=${AVAIL_MB}MB swap_used=${SWAP_USED_MB}MB — auto-fixing" flog "MEMORY CRISIS: avail=${AVAIL_MB}MB swap_used=${SWAP_USED_MB}MB — auto-fixing"
# Kill stale factory-spawned claude processes (>3h old) — skip interactive sessions # Kill stale agent-spawned claude processes (>3h old) — skip interactive sessions
STALE_CLAUDES=$(pgrep -f "claude -p" --older 10800 2>/dev/null || true) STALE_CLAUDES=$(pgrep -f "claude -p" --older 10800 2>/dev/null || true)
if [ -n "$STALE_CLAUDES" ]; then if [ -n "$STALE_CLAUDES" ]; then
echo "$STALE_CLAUDES" | xargs kill 2>/dev/null || true echo "$STALE_CLAUDES" | xargs kill 2>/dev/null || true
@ -113,7 +113,7 @@ if [ "${DISK_PERCENT:-0}" -gt 80 ]; then
# Docker cleanup (safe — keeps images) # Docker cleanup (safe — keeps images)
sudo docker system prune -f >/dev/null 2>&1 && fixed "Docker prune" sudo docker system prune -f >/dev/null 2>&1 && fixed "Docker prune"
# Truncate factory logs >10MB # Truncate supervisor logs >10MB
for logfile in "${FACTORY_ROOT}"/{dev,review,factory}/*.log; do for logfile in "${FACTORY_ROOT}"/{dev,review,factory}/*.log; do
if [ -f "$logfile" ]; then if [ -f "$logfile" ]; then
SIZE_KB=$(du -k "$logfile" 2>/dev/null | cut -f1) SIZE_KB=$(du -k "$logfile" 2>/dev/null | cut -f1)
@ -159,7 +159,7 @@ fi
# ============================================================================= # =============================================================================
# P2: FACTORY STOPPED — CI, dev-agent, git # P2: FACTORY STOPPED — CI, dev-agent, git
# ============================================================================= # =============================================================================
status "P2: checking factory" status "P2: checking pipeline"
# CI stuck # CI stuck
STUCK_CI=$(wpdb -c "SELECT count(*) FROM pipelines WHERE repo_id=${WOODPECKER_REPO_ID} AND status='running' AND EXTRACT(EPOCH FROM now() - to_timestamp(started)) > 1200;" 2>/dev/null | xargs || true) STUCK_CI=$(wpdb -c "SELECT count(*) FROM pipelines WHERE repo_id=${WOODPECKER_REPO_ID} AND status='running' AND EXTRACT(EPOCH FROM now() - to_timestamp(started)) > 1200;" 2>/dev/null | xargs || true)
@ -204,7 +204,7 @@ fi
# ============================================================================= # =============================================================================
# P2b: FACTORY STALLED — backlog exists but no agent running # P2b: FACTORY STALLED — backlog exists but no agent running
# ============================================================================= # =============================================================================
status "P2: checking factory stall" status "P2: checking pipeline stall"
BACKLOG_COUNT=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0") BACKLOG_COUNT=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
IN_PROGRESS=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0") IN_PROGRESS=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
@ -221,7 +221,7 @@ if [ "${BACKLOG_COUNT:-0}" -gt 0 ] && [ "${IN_PROGRESS:-0}" -eq 0 ]; then
IDLE_MIN=$(( (NOW_EPOCH - LAST_LOG_EPOCH) / 60 )) IDLE_MIN=$(( (NOW_EPOCH - LAST_LOG_EPOCH) / 60 ))
if [ "$IDLE_MIN" -gt 20 ]; then if [ "$IDLE_MIN" -gt 20 ]; then
p2 "Factory stalled: ${BACKLOG_COUNT} backlog issue(s), no agent ran for ${IDLE_MIN}min" p2 "Pipeline stalled: ${BACKLOG_COUNT} backlog issue(s), no agent ran for ${IDLE_MIN}min"
fi fi
fi fi
@ -277,7 +277,7 @@ done
# P4: HOUSEKEEPING — stale processes # P4: HOUSEKEEPING — stale processes
# ============================================================================= # =============================================================================
# Check for dev-agent escalations # Check for dev-agent escalations
ESCALATION_FILE="${FACTORY_ROOT}/factory/escalations.jsonl" ESCALATION_FILE="${FACTORY_ROOT}/supervisor/escalations.jsonl"
if [ -s "$ESCALATION_FILE" ]; then if [ -s "$ESCALATION_FILE" ]; then
ESCALATION_COUNT=$(wc -l < "$ESCALATION_FILE") ESCALATION_COUNT=$(wc -l < "$ESCALATION_FILE")
p3 "Dev-agent escalated ${ESCALATION_COUNT} issue(s) — see ${ESCALATION_FILE}" p3 "Dev-agent escalated ${ESCALATION_COUNT} issue(s) — see ${ESCALATION_FILE}"
@ -285,7 +285,7 @@ fi
status "P4: housekeeping" status "P4: housekeeping"
# Stale factory-spawned claude processes (>3h, not caught by P0) — skip interactive sessions # Stale agent-spawned claude processes (>3h, not caught by P0) — skip interactive sessions
STALE_CLAUDES=$(pgrep -f "claude -p" --older 10800 2>/dev/null || true) STALE_CLAUDES=$(pgrep -f "claude -p" --older 10800 2>/dev/null || true)
if [ -n "$STALE_CLAUDES" ]; then if [ -n "$STALE_CLAUDES" ]; then
echo "$STALE_CLAUDES" | xargs kill 2>/dev/null || true echo "$STALE_CLAUDES" | xargs kill 2>/dev/null || true
@ -308,7 +308,7 @@ for wt in /tmp/${PROJECT_NAME}-worktree-* /tmp/${PROJECT_NAME}-review-*; do
done done
git -C "$PROJECT_REPO_ROOT" worktree prune 2>/dev/null || true git -C "$PROJECT_REPO_ROOT" worktree prune 2>/dev/null || true
# Rotate factory log if >5MB # Rotate supervisor log if >5MB
for logfile in "${FACTORY_ROOT}"/{dev,review,factory}/*.log; do for logfile in "${FACTORY_ROOT}"/{dev,review,factory}/*.log; do
if [ -f "$logfile" ]; then if [ -f "$logfile" ]; then
SIZE_KB=$(du -k "$logfile" 2>/dev/null | cut -f1) SIZE_KB=$(du -k "$logfile" 2>/dev/null | cut -f1)
@ -329,12 +329,12 @@ if [ -n "$ALL_ALERTS" ]; then
ALERT_TEXT=$(echo -e "$ALL_ALERTS") ALERT_TEXT=$(echo -e "$ALL_ALERTS")
# Notify Matrix # Notify Matrix
matrix_send "supervisor" "⚠️ Factory alerts: matrix_send "supervisor" "⚠️ Supervisor alerts:
${ALERT_TEXT}" 2>/dev/null || true ${ALERT_TEXT}" 2>/dev/null || true
flog "Invoking claude -p for alerts" flog "Invoking claude -p for alerts"
CLAUDE_PROMPT="$(cat "$PROMPT_FILE" 2>/dev/null || echo "You are a factory supervisor. Fix the issue below.") CLAUDE_PROMPT="$(cat "$PROMPT_FILE" 2>/dev/null || echo "You are a supervisor agent. Fix the issue below.")
## Current Alerts ## Current Alerts
${ALERT_TEXT} ${ALERT_TEXT}

View file

@ -2,15 +2,15 @@
# update-prompt.sh — Append a lesson to a best-practices file # update-prompt.sh — Append a lesson to a best-practices file
# #
# Usage: # Usage:
# ./factory/update-prompt.sh "best-practices/memory.md" "### Title\nBody text" # ./supervisor/update-prompt.sh "best-practices/memory.md" "### Title\nBody text"
# ./factory/update-prompt.sh --from-file "best-practices/memory.md" /tmp/lesson.md # ./supervisor/update-prompt.sh --from-file "best-practices/memory.md" /tmp/lesson.md
# #
# Called by claude -p when it learns something during a fix. # Called by claude -p when it learns something during a fix.
# Commits and pushes the update to the disinto repo. # Commits and pushes the update to the disinto repo.
source "$(dirname "$0")/../lib/env.sh" source "$(dirname "$0")/../lib/env.sh"
TARGET_FILE="${FACTORY_ROOT}/factory/$1" TARGET_FILE="${FACTORY_ROOT}/supervisor/$1"
shift shift
if [ "$1" = "--from-file" ] && [ -f "$2" ]; then if [ "$1" = "--from-file" ] && [ -f "$2" ]; then
@ -40,8 +40,8 @@ else
fi fi
cd "$FACTORY_ROOT" cd "$FACTORY_ROOT"
git add "factory/$1" 2>/dev/null || git add "$TARGET_FILE" git add "supervisor/$1" 2>/dev/null || git add "$TARGET_FILE"
git commit -m "factory: learned — $(echo "$LESSON" | head -1 | sed 's/^#* *//')" --no-verify 2>/dev/null git commit -m "supervisor: learned — $(echo "$LESSON" | head -1 | sed 's/^#* *//')" --no-verify 2>/dev/null
git push origin main 2>/dev/null git push origin main 2>/dev/null
log "Updated $(basename "$TARGET_FILE") with new lesson" log "Updated $(basename "$TARGET_FILE") with new lesson"