refactor: rename factory/ → supervisor/, factory-poll → supervisor-poll
The supervisor agent was confusingly named "factory" (same as the project). Rename directory, script, log, lock, status, and escalation files. Update all references across scripts and docs. FACTORY_ROOT env var unchanged (refers to project root, not agent). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
8d73c2f8f9
commit
77cb4c4643
15 changed files with 68 additions and 68 deletions
14
BOOTSTRAP.md
14
BOOTSTRAP.md
|
|
@ -55,7 +55,7 @@ CLAUDE_TIMEOUT=7200 # seconds per Claude invocation
|
||||||
|
|
||||||
### Required: CI pipeline
|
### Required: CI pipeline
|
||||||
|
|
||||||
The repo needs at least one Woodpecker pipeline. Dark-factory monitors CI status to decide when a PR is ready for review and when it can merge.
|
The repo needs at least one Woodpecker pipeline. Disinto monitors CI status to decide when a PR is ready for review and when it can merge.
|
||||||
|
|
||||||
### Required: `CLAUDE.md`
|
### Required: `CLAUDE.md`
|
||||||
|
|
||||||
|
|
@ -155,7 +155,7 @@ Add (adjust paths):
|
||||||
FACTORY_ROOT=/home/you/disinto
|
FACTORY_ROOT=/home/you/disinto
|
||||||
|
|
||||||
# Supervisor — health checks, auto-healing (every 10 min)
|
# Supervisor — health checks, auto-healing (every 10 min)
|
||||||
0,10,20,30,40,50 * * * * $FACTORY_ROOT/factory/factory-poll.sh
|
0,10,20,30,40,50 * * * * $FACTORY_ROOT/supervisor/supervisor-poll.sh
|
||||||
|
|
||||||
# Review agent — find unreviewed PRs (every 10 min, offset +3)
|
# Review agent — find unreviewed PRs (every 10 min, offset +3)
|
||||||
3,13,23,33,43,53 * * * * $FACTORY_ROOT/review/review-poll.sh
|
3,13,23,33,43,53 * * * * $FACTORY_ROOT/review/review-poll.sh
|
||||||
|
|
@ -176,7 +176,7 @@ The 3-minute offsets prevent agents from competing for resources.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Should complete with "all clear" (no problems to fix)
|
# Should complete with "all clear" (no problems to fix)
|
||||||
bash factory/factory-poll.sh
|
bash supervisor/supervisor-poll.sh
|
||||||
|
|
||||||
# Should list backlog issues (or "no backlog issues")
|
# Should list backlog issues (or "no backlog issues")
|
||||||
bash dev/dev-poll.sh
|
bash dev/dev-poll.sh
|
||||||
|
|
@ -188,7 +188,7 @@ bash review/review-poll.sh
|
||||||
Check logs after a few cycles:
|
Check logs after a few cycles:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
tail -30 factory/factory.log
|
tail -30 supervisor/supervisor.log
|
||||||
tail -30 dev/dev-agent.log
|
tail -30 dev/dev-agent.log
|
||||||
tail -30 review/review.log
|
tail -30 review/review.log
|
||||||
```
|
```
|
||||||
|
|
@ -203,7 +203,7 @@ If you want real-time notifications and human-in-the-loop escalation:
|
||||||
sudo cp lib/matrix_listener.service /etc/systemd/system/
|
sudo cp lib/matrix_listener.service /etc/systemd/system/
|
||||||
sudo systemctl enable --now matrix_listener
|
sudo systemctl enable --now matrix_listener
|
||||||
```
|
```
|
||||||
3. The factory and gardener will post status updates and escalation threads to the configured room. Reply in-thread to answer escalations.
|
3. The supervisor and gardener will post status updates and escalation threads to the configured room. Reply in-thread to answer escalations.
|
||||||
|
|
||||||
## Lifecycle
|
## Lifecycle
|
||||||
|
|
||||||
|
|
@ -219,7 +219,7 @@ You write issues (with backlog label)
|
||||||
→ merge, close issue, clean up
|
→ merge, close issue, clean up
|
||||||
|
|
||||||
Meanwhile:
|
Meanwhile:
|
||||||
factory-poll monitors health, kills stale processes, manages resources
|
supervisor-poll monitors health, kills stale processes, manages resources
|
||||||
gardener grooms backlog: closes duplicates, promotes tech-debt, escalates ambiguity
|
gardener grooms backlog: closes duplicates, promotes tech-debt, escalates ambiguity
|
||||||
planner rebuilds AGENTS.md from git history, gap-analyses against VISION.md
|
planner rebuilds AGENTS.md from git history, gap-analyses against VISION.md
|
||||||
```
|
```
|
||||||
|
|
@ -233,4 +233,4 @@ Meanwhile:
|
||||||
| CI stuck | `bash lib/ci-debug.sh` — check Woodpecker. Rate-limited? (exit 128 = wait 15 min) |
|
| CI stuck | `bash lib/ci-debug.sh` — check Woodpecker. Rate-limited? (exit 128 = wait 15 min) |
|
||||||
| Claude not found | `which claude` — must be in PATH. Check `lib/env.sh` adds `~/.local/bin`. |
|
| Claude not found | `which claude` — must be in PATH. Check `lib/env.sh` adds `~/.local/bin`. |
|
||||||
| Merge fails | Branch protection misconfigured? Review bot needs write access to the repo. |
|
| Merge fails | Branch protection misconfigured? Review bot needs write access to the repo. |
|
||||||
| Memory issues | Factory auto-heals at <500 MB free. Check `factory/factory.log` for P0 alerts. |
|
| Memory issues | Supervisor auto-heals at <500 MB free. Check `supervisor/supervisor.log` for P0 alerts. |
|
||||||
|
|
|
||||||
16
README.md
16
README.md
|
|
@ -9,7 +9,7 @@ Point it at a Codeberg repo with a Woodpecker CI pipeline and it will pick up is
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
cron (*/10) ──→ factory-poll.sh ← supervisor (bash checks, zero tokens)
|
cron (*/10) ──→ supervisor-poll.sh ← supervisor (bash checks, zero tokens)
|
||||||
├── all clear? → exit 0
|
├── all clear? → exit 0
|
||||||
└── problem? → claude -p (diagnose, fix, or escalate)
|
└── problem? → claude -p (diagnose, fix, or escalate)
|
||||||
|
|
||||||
|
|
@ -33,9 +33,9 @@ all agents ──→ matrix_send() ← status updates, escalations, merge no
|
||||||
**Required:**
|
**Required:**
|
||||||
|
|
||||||
- [Claude CLI](https://docs.anthropic.com/en/docs/claude-cli) — `claude` in PATH, authenticated
|
- [Claude CLI](https://docs.anthropic.com/en/docs/claude-cli) — `claude` in PATH, authenticated
|
||||||
- [Codeberg](https://codeberg.org/) account with an API token — the factory reads issues, opens PRs, posts comments, and merges via the Codeberg API
|
- [Codeberg](https://codeberg.org/) account with an API token — disinto reads issues, opens PRs, posts comments, and merges via the Codeberg API
|
||||||
- A second Codeberg account for the review bot — reviews posted under a separate identity so the dev-agent doesn't review its own PRs (`REVIEW_BOT_TOKEN`)
|
- A second Codeberg account for the review bot — reviews posted under a separate identity so the dev-agent doesn't review its own PRs (`REVIEW_BOT_TOKEN`)
|
||||||
- [Woodpecker CI](https://woodpecker-ci.org/) — local instance connected to your Codeberg repo; the factory monitors pipelines, retries failures, and queries the Woodpecker Postgres DB directly
|
- [Woodpecker CI](https://woodpecker-ci.org/) — local instance connected to your Codeberg repo; disinto monitors pipelines, retries failures, and queries the Woodpecker Postgres DB directly
|
||||||
- PostgreSQL client (`psql`) — for Woodpecker DB queries (pipeline status, build counts)
|
- PostgreSQL client (`psql`) — for Woodpecker DB queries (pipeline status, build counts)
|
||||||
- `jq`, `curl`, `git`
|
- `jq`, `curl`, `git`
|
||||||
|
|
||||||
|
|
@ -84,13 +84,13 @@ CLAUDE_TIMEOUT=7200 # max seconds per Claude invocation (default: 2h)
|
||||||
# 3. Install cron (staggered to avoid overlap)
|
# 3. Install cron (staggered to avoid overlap)
|
||||||
crontab -e
|
crontab -e
|
||||||
# Add:
|
# Add:
|
||||||
# 0,10,20,30,40,50 * * * * /path/to/disinto/factory/factory-poll.sh
|
# 0,10,20,30,40,50 * * * * /path/to/disinto/supervisor/supervisor-poll.sh
|
||||||
# 3,13,23,33,43,53 * * * * /path/to/disinto/review/review-poll.sh
|
# 3,13,23,33,43,53 * * * * /path/to/disinto/review/review-poll.sh
|
||||||
# 6,16,26,36,46,56 * * * * /path/to/disinto/dev/dev-poll.sh
|
# 6,16,26,36,46,56 * * * * /path/to/disinto/dev/dev-poll.sh
|
||||||
# 15 8 * * * /path/to/disinto/gardener/gardener-poll.sh
|
# 15 8 * * * /path/to/disinto/gardener/gardener-poll.sh
|
||||||
|
|
||||||
# 4. Verify
|
# 4. Verify
|
||||||
bash factory/factory-poll.sh # should log "all clear"
|
bash supervisor/supervisor-poll.sh # should log "all clear"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Directory Structure
|
## Directory Structure
|
||||||
|
|
@ -113,8 +113,8 @@ disinto/
|
||||||
├── gardener/
|
├── gardener/
|
||||||
│ ├── gardener-poll.sh # Cron entry: backlog grooming
|
│ ├── gardener-poll.sh # Cron entry: backlog grooming
|
||||||
│ └── best-practices.md # Gardener knowledge base
|
│ └── best-practices.md # Gardener knowledge base
|
||||||
└── factory/
|
└── supervisor/
|
||||||
├── factory-poll.sh # Supervisor: health checks + claude -p
|
├── supervisor-poll.sh # Supervisor: health checks + claude -p
|
||||||
├── PROMPT.md # Supervisor's system prompt
|
├── PROMPT.md # Supervisor's system prompt
|
||||||
├── update-prompt.sh # Self-learning: append to best-practices
|
├── update-prompt.sh # Self-learning: append to best-practices
|
||||||
└── best-practices/ # Progressive disclosure knowledge base
|
└── best-practices/ # Progressive disclosure knowledge base
|
||||||
|
|
@ -131,7 +131,7 @@ disinto/
|
||||||
|
|
||||||
| Agent | Trigger | Job |
|
| Agent | Trigger | Job |
|
||||||
|-------|---------|-----|
|
|-------|---------|-----|
|
||||||
| **Factory** (supervisor) | Every 10 min | Health checks (RAM, disk, CI, git). Calls Claude only when something is broken. Self-improving via `best-practices/`. |
|
| **Supervisor** | Every 10 min | Health checks (RAM, disk, CI, git). Calls Claude only when something is broken. Self-improving via `best-practices/`. |
|
||||||
| **Dev** | Every 10 min | Picks up `backlog`-labeled issues, creates a branch, implements, opens a PR, monitors CI, responds to review, merges. |
|
| **Dev** | Every 10 min | Picks up `backlog`-labeled issues, creates a branch, implements, opens a PR, monitors CI, responds to review, merges. |
|
||||||
| **Review** | Every 10 min | Finds PRs without review, runs Claude-powered code review, approves or requests changes. |
|
| **Review** | Every 10 min | Finds PRs without review, runs Claude-powered code review, approves or requests changes. |
|
||||||
| **Gardener** | Daily | Grooms the issue backlog: detects duplicates, promotes `tech-debt` to `backlog`, closes stale issues, escalates ambiguous items. |
|
| **Gardener** | Daily | Grooms the issue backlog: detects duplicates, promotes `tech-debt` to `backlog`, closes stale issues, escalates ambiguous items. |
|
||||||
|
|
|
||||||
|
|
@ -1106,9 +1106,9 @@ while [ "$REVIEW_ROUND" -lt "$MAX_REVIEW_ROUNDS" ]; do
|
||||||
CI_FIX_COUNT=$(( ${CI_FIX_COUNT:-0} + 1 ))
|
CI_FIX_COUNT=$(( ${CI_FIX_COUNT:-0} + 1 ))
|
||||||
if [ "$CI_FIX_COUNT" -gt 2 ]; then
|
if [ "$CI_FIX_COUNT" -gt 2 ]; then
|
||||||
log "CI failure not recoverable after ${CI_FIX_COUNT} fix attempts"
|
log "CI failure not recoverable after ${CI_FIX_COUNT} fix attempts"
|
||||||
# Escalate to supervisor — write marker for factory-poll.sh to pick up
|
# Escalate to supervisor — write marker for supervisor-poll.sh to pick up
|
||||||
echo "{\"issue\":${ISSUE},\"pr\":${PR_NUMBER},\"reason\":\"ci_exhausted\",\"step\":\"${FAILED_STEP:-unknown}\",\"attempts\":${CI_FIX_COUNT},\"ts\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" \
|
echo "{\"issue\":${ISSUE},\"pr\":${PR_NUMBER},\"reason\":\"ci_exhausted\",\"step\":\"${FAILED_STEP:-unknown}\",\"attempts\":${CI_FIX_COUNT},\"ts\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" \
|
||||||
>> "${FACTORY_ROOT}/factory/escalations.jsonl"
|
>> "${FACTORY_ROOT}/supervisor/escalations.jsonl"
|
||||||
log "escalated to supervisor via escalations.jsonl"
|
log "escalated to supervisor via escalations.jsonl"
|
||||||
break
|
break
|
||||||
fi
|
fi
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,5 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# dev-poll.sh — Pull-based factory: find the next ready issue and start dev-agent
|
# dev-poll.sh — Pull-based scheduler: find the next ready issue and start dev-agent
|
||||||
#
|
#
|
||||||
# Pull system: issues labeled "backlog" are candidates. An issue is READY when
|
# Pull system: issues labeled "backlog" are candidates. An issue is READY when
|
||||||
# ALL its dependency issues are closed (and their PRs merged).
|
# ALL its dependency issues are closed (and their PRs merged).
|
||||||
|
|
@ -104,7 +104,7 @@ dep_is_merged() {
|
||||||
return 1
|
return 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# Issue closed = dep satisfied. The factory only closes issues after
|
# Issue closed = dep satisfied. The scheduler only closes issues after
|
||||||
# merging, so closed state is trustworthy. No need to hunt for the
|
# merging, so closed state is trustworthy. No need to hunt for the
|
||||||
# specific PR — that was over-engineering that caused false negatives.
|
# specific PR — that was over-engineering that caused false negatives.
|
||||||
return 0
|
return 0
|
||||||
|
|
|
||||||
|
|
@ -1,11 +1,11 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# matrix_listener.sh — Long-poll Matrix sync daemon
|
# matrix_listener.sh — Long-poll Matrix sync daemon
|
||||||
#
|
#
|
||||||
# Listens for replies in the factory Matrix room and dispatches them
|
# Listens for replies in the Matrix coordination room and dispatches them
|
||||||
# to the appropriate agent via well-known files.
|
# to the appropriate agent via well-known files.
|
||||||
#
|
#
|
||||||
# Dispatch:
|
# Dispatch:
|
||||||
# Thread reply to [supervisor] message → /tmp/factory-escalation-reply
|
# Thread reply to [supervisor] message → /tmp/supervisor-escalation-reply
|
||||||
# Thread reply to [gardener] message → /tmp/gardener-escalation-reply
|
# Thread reply to [gardener] message → /tmp/gardener-escalation-reply
|
||||||
#
|
#
|
||||||
# Run as systemd service (see matrix_listener.service) or manually:
|
# Run as systemd service (see matrix_listener.service) or manually:
|
||||||
|
|
@ -18,7 +18,7 @@ source "$(dirname "$0")/../lib/env.sh"
|
||||||
|
|
||||||
SINCE_FILE="/tmp/matrix-listener-since"
|
SINCE_FILE="/tmp/matrix-listener-since"
|
||||||
THREAD_MAP="${MATRIX_THREAD_MAP:-/tmp/matrix-thread-map}"
|
THREAD_MAP="${MATRIX_THREAD_MAP:-/tmp/matrix-thread-map}"
|
||||||
LOGFILE="${FACTORY_ROOT}/factory/matrix-listener.log"
|
LOGFILE="${FACTORY_ROOT}/supervisor/matrix-listener.log"
|
||||||
SYNC_TIMEOUT=30000 # 30s long-poll
|
SYNC_TIMEOUT=30000 # 30s long-poll
|
||||||
BACKOFF=5
|
BACKOFF=5
|
||||||
MAX_BACKOFF=60
|
MAX_BACKOFF=60
|
||||||
|
|
@ -133,7 +133,7 @@ while true; do
|
||||||
|
|
||||||
case "$AGENT" in
|
case "$AGENT" in
|
||||||
supervisor)
|
supervisor)
|
||||||
printf '%s\t%s\t%s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$SENDER" "$BODY" >> /tmp/factory-escalation-reply
|
printf '%s\t%s\t%s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$SENDER" "$BODY" >> /tmp/supervisor-escalation-reply
|
||||||
# Acknowledge
|
# Acknowledge
|
||||||
matrix_send "supervisor" "✓ received, will act on next poll" "$THREAD_ROOT" >/dev/null 2>&1 || true
|
matrix_send "supervisor" "✓ received, will act on next poll" "$THREAD_ROOT" >/dev/null 2>&1 || true
|
||||||
;;
|
;;
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,7 @@
|
||||||
# Factory Supervisor
|
# Supervisor Agent
|
||||||
|
|
||||||
You are the factory supervisor for `$CODEBERG_REPO`. You were called because
|
You are the supervisor agent for `$CODEBERG_REPO`. You were called because
|
||||||
`factory-poll.sh` detected an issue it couldn't auto-fix.
|
`supervisor-poll.sh` detected an issue it couldn't auto-fix.
|
||||||
|
|
||||||
## Priority Order
|
## Priority Order
|
||||||
|
|
||||||
|
|
@ -16,13 +16,13 @@ You are the factory supervisor for `$CODEBERG_REPO`. You were called because
|
||||||
Fix the issue yourself. You have full shell access and `--dangerously-skip-permissions`.
|
Fix the issue yourself. You have full shell access and `--dangerously-skip-permissions`.
|
||||||
|
|
||||||
Before acting, read the relevant best-practices file:
|
Before acting, read the relevant best-practices file:
|
||||||
- Memory issues → `cat ${FACTORY_ROOT}/factory/best-practices/memory.md`
|
- Memory issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/memory.md`
|
||||||
- Disk issues → `cat ${FACTORY_ROOT}/factory/best-practices/disk.md`
|
- Disk issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/disk.md`
|
||||||
- CI issues → `cat ${FACTORY_ROOT}/factory/best-practices/ci.md`
|
- CI issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/ci.md`
|
||||||
- Codeberg / rate limits → `cat ${FACTORY_ROOT}/factory/best-practices/codeberg.md`
|
- Codeberg / rate limits → `cat ${FACTORY_ROOT}/supervisor/best-practices/codeberg.md`
|
||||||
- Dev-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/dev-agent.md`
|
- Dev-agent issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/dev-agent.md`
|
||||||
- Review-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/review-agent.md`
|
- Review-agent issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/review-agent.md`
|
||||||
- Git issues → `cat ${FACTORY_ROOT}/factory/best-practices/git.md`
|
- Git issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/git.md`
|
||||||
|
|
||||||
## Credentials & API Access
|
## Credentials & API Access
|
||||||
|
|
||||||
|
|
@ -66,6 +66,6 @@ ESCALATE: <what's wrong>
|
||||||
|
|
||||||
If you discover something new, append it to the relevant best-practices file:
|
If you discover something new, append it to the relevant best-practices file:
|
||||||
```bash
|
```bash
|
||||||
bash ${FACTORY_ROOT}/factory/update-prompt.sh "best-practices/<file>.md" "### Lesson title
|
bash ${FACTORY_ROOT}/supervisor/update-prompt.sh "best-practices/<file>.md" "### Lesson title
|
||||||
Description of what you learned."
|
Description of what you learned."
|
||||||
```
|
```
|
||||||
|
|
@ -22,7 +22,7 @@ cd <worktree> && git commit --allow-empty -m "ci: retrigger" --no-verify && git
|
||||||
```
|
```
|
||||||
|
|
||||||
### Prevention
|
### Prevention
|
||||||
- The factory runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously.
|
- The system runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously.
|
||||||
- One pipeline at a time is ideal on this VPS (resource + rate limit reasons).
|
- One pipeline at a time is ideal on this VPS (resource + rate limit reasons).
|
||||||
- If >3 pipelines are pending/running, do NOT create more work.
|
- If >3 pipelines are pending/running, do NOT create more work.
|
||||||
|
|
||||||
|
|
@ -44,12 +44,12 @@ DO NOT try to find the specific PR that closed an issue. This is over-engineerin
|
||||||
- Codeberg shares issue/PR numbering — no guaranteed relationship
|
- Codeberg shares issue/PR numbering — no guaranteed relationship
|
||||||
- PRs don't always mention the issue number in title/body
|
- PRs don't always mention the issue number in title/body
|
||||||
- Searching last N closed PRs misses older merges
|
- Searching last N closed PRs misses older merges
|
||||||
- The factory itself closes issues after merging, so closed = merged
|
- The dev-agent closes issues after merging, so closed = merged
|
||||||
|
|
||||||
The only check needed: `issue.state == "closed"`.
|
The only check needed: `issue.state == "closed"`.
|
||||||
|
|
||||||
### False Positive: Status Unchanged Alert
|
### False Positive: Status Unchanged Alert
|
||||||
The factory-poll alert 'status unchanged for Nmin' is a false positive for complex implementation tasks. The status is set to 'claude assessing + implementing' at the START of the `timeout 7200 claude -p ...` call and only updates after Claude finishes. Normal complex tasks (multi-file Solidity changes + forge test) take 45-90 minutes. To distinguish a false positive from a real stuck agent: check that the claude PID is alive (`ps -p <PID>`), consuming CPU (>0%), and has active threads (`pstree -p <PID>`). If the process is alive and using CPU, do NOT restart it — this wastes completed work.
|
The supervisor-poll alert 'status unchanged for Nmin' is a false positive for complex implementation tasks. The status is set to 'claude assessing + implementing' at the START of the `timeout 7200 claude -p ...` call and only updates after Claude finishes. Normal complex tasks (multi-file Solidity changes + forge test) take 45-90 minutes. To distinguish a false positive from a real stuck agent: check that the claude PID is alive (`ps -p <PID>`), consuming CPU (>0%), and has active threads (`pstree -p <PID>`). If the process is alive and using CPU, do NOT restart it — this wastes completed work.
|
||||||
|
|
||||||
### False Positive: 'Waiting for CI + Review' Alert
|
### False Positive: 'Waiting for CI + Review' Alert
|
||||||
The 'status unchanged for Nmin' alert is also a false positive when status is 'waiting for CI + review on PR #N (round R)'. This is an intentional sleep/poll loop — the agent is waiting for CI to pass and then for review-poll to post a review. CI can take 20–40 minutes; review follows. Do NOT restart the agent. Confirm by checking: (1) agent PID is alive, (2) CI commit status via `codeberg_api GET /commits/<sha>/status`, (3) review-poll log shows it will pick up the PR on next cycle.
|
The 'status unchanged for Nmin' alert is also a false positive when status is 'waiting for CI + review on PR #N (round R)'. This is an intentional sleep/poll loop — the agent is waiting for CI to pass and then for review-poll to post a review. CI can take 20–40 minutes; review follows. Do NOT restart the agent. Confirm by checking: (1) agent PID is alive, (2) CI commit status via `codeberg_api GET /commits/<sha>/status`, (3) review-poll log shows it will pick up the PR on next cycle.
|
||||||
|
|
@ -2,7 +2,7 @@
|
||||||
|
|
||||||
## Safe Fixes
|
## Safe Fixes
|
||||||
- Docker cleanup: `sudo docker system prune -f` (keeps images, removes stopped containers + dangling layers)
|
- Docker cleanup: `sudo docker system prune -f` (keeps images, removes stopped containers + dangling layers)
|
||||||
- Truncate factory logs >5MB: `truncate -s 0 <file>`
|
- Truncate supervisor logs >5MB: `truncate -s 0 <file>`
|
||||||
- Remove stale worktrees: check `/tmp/${PROJECT_NAME}-worktree-*`, only if dev-agent not running on them
|
- Remove stale worktrees: check `/tmp/${PROJECT_NAME}-worktree-*`, only if dev-agent not running on them
|
||||||
- Woodpecker log_entries: `DELETE FROM log_entries WHERE id < (SELECT max(id) - 100000 FROM log_entries);` then `VACUUM;`
|
- Woodpecker log_entries: `DELETE FROM log_entries WHERE id < (SELECT max(id) - 100000 FROM log_entries);` then `VACUUM;`
|
||||||
- Node module caches in worktrees: `rm -rf /tmp/${PROJECT_NAME}-worktree-*/node_modules/`
|
- Node module caches in worktrees: `rm -rf /tmp/${PROJECT_NAME}-worktree-*/node_modules/`
|
||||||
|
|
@ -34,7 +34,7 @@
|
||||||
|
|
||||||
## Known Issues
|
## Known Issues
|
||||||
- Main repo MUST be on $PRIMARY_BRANCH at all times. Dev work happens in worktrees.
|
- Main repo MUST be on $PRIMARY_BRANCH at all times. Dev work happens in worktrees.
|
||||||
- Stale rebases (detached HEAD) break all worktree creation — silent factory stall.
|
- Stale rebases (detached HEAD) break all worktree creation — silent pipeline stall.
|
||||||
- `git worktree add` fails if target directory exists (even empty). Remove first.
|
- `git worktree add` fails if target directory exists (even empty). Remove first.
|
||||||
- Many old branches exist locally (100+). Normal — don't bulk-delete.
|
- Many old branches exist locally (100+). Normal — don't bulk-delete.
|
||||||
|
|
||||||
|
|
@ -47,7 +47,7 @@
|
||||||
|
|
||||||
## Lessons Learned
|
## Lessons Learned
|
||||||
- NEVER delete remote branches before confirming merge. Close PR, rebase locally, force-push if needed.
|
- NEVER delete remote branches before confirming merge. Close PR, rebase locally, force-push if needed.
|
||||||
- Stale rebase caused 5h factory stall once (2026-03-11). Auto-heal added to dev-agent.
|
- Stale rebase caused 5h pipeline stall once (2026-03-11). Auto-heal added to dev-agent.
|
||||||
- lint-staged hooks fail when `forge` not in PATH. Use `--no-verify` when committing from scripts.
|
- lint-staged hooks fail when `forge` not in PATH. Use `--no-verify` when committing from scripts.
|
||||||
|
|
||||||
### PR #608 Post-Mortem (2026-03-12/13)
|
### PR #608 Post-Mortem (2026-03-12/13)
|
||||||
|
|
@ -19,7 +19,7 @@
|
||||||
- **Hallucinated findings** — bot may flag non-issues. This needs Clawy's judgment — escalate.
|
- **Hallucinated findings** — bot may flag non-issues. This needs Clawy's judgment — escalate.
|
||||||
|
|
||||||
## Monitoring
|
## Monitoring
|
||||||
- Unreviewed PRs with CI pass for >1h → factory-poll.sh auto-triggers review
|
- Unreviewed PRs with CI pass for >1h → supervisor-poll.sh auto-triggers review
|
||||||
- Review errors should resolve on next poll cycle
|
- Review errors should resolve on next poll cycle
|
||||||
- If same PR fails review 3+ times → likely a prompt issue, escalate
|
- If same PR fails review 3+ times → likely a prompt issue, escalate
|
||||||
|
|
||||||
|
|
@ -1,20 +1,20 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# factory-poll.sh — Factory supervisor: bash checks + claude -p for fixes
|
# supervisor-poll.sh — Supervisor agent: bash checks + claude -p for fixes
|
||||||
#
|
#
|
||||||
# Runs every 10min via cron. Does all health checks in bash (zero tokens).
|
# Runs every 10min via cron. Does all health checks in bash (zero tokens).
|
||||||
# Only invokes claude -p when auto-fix fails or issue is complex.
|
# Only invokes claude -p when auto-fix fails or issue is complex.
|
||||||
#
|
#
|
||||||
# Cron: */10 * * * * /path/to/disinto/factory/factory-poll.sh
|
# Cron: */10 * * * * /path/to/disinto/supervisor/supervisor-poll.sh
|
||||||
#
|
#
|
||||||
# Peek: cat /tmp/factory-status
|
# Peek: cat /tmp/supervisor-status
|
||||||
# Log: tail -f /path/to/disinto/factory/factory.log
|
# Log: tail -f /path/to/disinto/supervisor/supervisor.log
|
||||||
|
|
||||||
source "$(dirname "$0")/../lib/env.sh"
|
source "$(dirname "$0")/../lib/env.sh"
|
||||||
|
|
||||||
LOGFILE="${FACTORY_ROOT}/factory/factory.log"
|
LOGFILE="${FACTORY_ROOT}/supervisor/supervisor.log"
|
||||||
STATUSFILE="/tmp/factory-status"
|
STATUSFILE="/tmp/supervisor-status"
|
||||||
LOCKFILE="/tmp/factory-poll.lock"
|
LOCKFILE="/tmp/supervisor-poll.lock"
|
||||||
PROMPT_FILE="${FACTORY_ROOT}/factory/PROMPT.md"
|
PROMPT_FILE="${FACTORY_ROOT}/supervisor/PROMPT.md"
|
||||||
|
|
||||||
# Prevent overlapping runs
|
# Prevent overlapping runs
|
||||||
if [ -f "$LOCKFILE" ]; then
|
if [ -f "$LOCKFILE" ]; then
|
||||||
|
|
@ -32,15 +32,15 @@ flog() {
|
||||||
}
|
}
|
||||||
|
|
||||||
status() {
|
status() {
|
||||||
printf '[%s] factory: %s\n' "$(date -u '+%Y-%m-%d %H:%M:%S UTC')" "$*" > "$STATUSFILE"
|
printf '[%s] supervisor: %s\n' "$(date -u '+%Y-%m-%d %H:%M:%S UTC')" "$*" > "$STATUSFILE"
|
||||||
flog "$*"
|
flog "$*"
|
||||||
}
|
}
|
||||||
|
|
||||||
# ── Check for escalation replies from Matrix ──────────────────────────────
|
# ── Check for escalation replies from Matrix ──────────────────────────────
|
||||||
ESCALATION_REPLY=""
|
ESCALATION_REPLY=""
|
||||||
if [ -s /tmp/factory-escalation-reply ]; then
|
if [ -s /tmp/supervisor-escalation-reply ]; then
|
||||||
ESCALATION_REPLY=$(cat /tmp/factory-escalation-reply)
|
ESCALATION_REPLY=$(cat /tmp/supervisor-escalation-reply)
|
||||||
rm -f /tmp/factory-escalation-reply
|
rm -f /tmp/supervisor-escalation-reply
|
||||||
flog "Got escalation reply: $(echo "$ESCALATION_REPLY" | head -1)"
|
flog "Got escalation reply: $(echo "$ESCALATION_REPLY" | head -1)"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
|
@ -71,7 +71,7 @@ SWAP_USED_MB=$(free -m | awk '/Swap:/{print $3}')
|
||||||
if [ "${AVAIL_MB:-9999}" -lt 500 ] || { [ "${SWAP_USED_MB:-0}" -gt 3000 ] && [ "${AVAIL_MB:-9999}" -lt 2000 ]; }; then
|
if [ "${AVAIL_MB:-9999}" -lt 500 ] || { [ "${SWAP_USED_MB:-0}" -gt 3000 ] && [ "${AVAIL_MB:-9999}" -lt 2000 ]; }; then
|
||||||
flog "MEMORY CRISIS: avail=${AVAIL_MB}MB swap_used=${SWAP_USED_MB}MB — auto-fixing"
|
flog "MEMORY CRISIS: avail=${AVAIL_MB}MB swap_used=${SWAP_USED_MB}MB — auto-fixing"
|
||||||
|
|
||||||
# Kill stale factory-spawned claude processes (>3h old) — skip interactive sessions
|
# Kill stale agent-spawned claude processes (>3h old) — skip interactive sessions
|
||||||
STALE_CLAUDES=$(pgrep -f "claude -p" --older 10800 2>/dev/null || true)
|
STALE_CLAUDES=$(pgrep -f "claude -p" --older 10800 2>/dev/null || true)
|
||||||
if [ -n "$STALE_CLAUDES" ]; then
|
if [ -n "$STALE_CLAUDES" ]; then
|
||||||
echo "$STALE_CLAUDES" | xargs kill 2>/dev/null || true
|
echo "$STALE_CLAUDES" | xargs kill 2>/dev/null || true
|
||||||
|
|
@ -113,7 +113,7 @@ if [ "${DISK_PERCENT:-0}" -gt 80 ]; then
|
||||||
# Docker cleanup (safe — keeps images)
|
# Docker cleanup (safe — keeps images)
|
||||||
sudo docker system prune -f >/dev/null 2>&1 && fixed "Docker prune"
|
sudo docker system prune -f >/dev/null 2>&1 && fixed "Docker prune"
|
||||||
|
|
||||||
# Truncate factory logs >10MB
|
# Truncate supervisor logs >10MB
|
||||||
for logfile in "${FACTORY_ROOT}"/{dev,review,factory}/*.log; do
|
for logfile in "${FACTORY_ROOT}"/{dev,review,factory}/*.log; do
|
||||||
if [ -f "$logfile" ]; then
|
if [ -f "$logfile" ]; then
|
||||||
SIZE_KB=$(du -k "$logfile" 2>/dev/null | cut -f1)
|
SIZE_KB=$(du -k "$logfile" 2>/dev/null | cut -f1)
|
||||||
|
|
@ -159,7 +159,7 @@ fi
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# P2: FACTORY STOPPED — CI, dev-agent, git
|
# P2: FACTORY STOPPED — CI, dev-agent, git
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
status "P2: checking factory"
|
status "P2: checking pipeline"
|
||||||
|
|
||||||
# CI stuck
|
# CI stuck
|
||||||
STUCK_CI=$(wpdb -c "SELECT count(*) FROM pipelines WHERE repo_id=${WOODPECKER_REPO_ID} AND status='running' AND EXTRACT(EPOCH FROM now() - to_timestamp(started)) > 1200;" 2>/dev/null | xargs || true)
|
STUCK_CI=$(wpdb -c "SELECT count(*) FROM pipelines WHERE repo_id=${WOODPECKER_REPO_ID} AND status='running' AND EXTRACT(EPOCH FROM now() - to_timestamp(started)) > 1200;" 2>/dev/null | xargs || true)
|
||||||
|
|
@ -204,7 +204,7 @@ fi
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# P2b: FACTORY STALLED — backlog exists but no agent running
|
# P2b: FACTORY STALLED — backlog exists but no agent running
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
status "P2: checking factory stall"
|
status "P2: checking pipeline stall"
|
||||||
|
|
||||||
BACKLOG_COUNT=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
|
BACKLOG_COUNT=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
|
||||||
IN_PROGRESS=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
|
IN_PROGRESS=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
|
||||||
|
|
@ -221,7 +221,7 @@ if [ "${BACKLOG_COUNT:-0}" -gt 0 ] && [ "${IN_PROGRESS:-0}" -eq 0 ]; then
|
||||||
IDLE_MIN=$(( (NOW_EPOCH - LAST_LOG_EPOCH) / 60 ))
|
IDLE_MIN=$(( (NOW_EPOCH - LAST_LOG_EPOCH) / 60 ))
|
||||||
|
|
||||||
if [ "$IDLE_MIN" -gt 20 ]; then
|
if [ "$IDLE_MIN" -gt 20 ]; then
|
||||||
p2 "Factory stalled: ${BACKLOG_COUNT} backlog issue(s), no agent ran for ${IDLE_MIN}min"
|
p2 "Pipeline stalled: ${BACKLOG_COUNT} backlog issue(s), no agent ran for ${IDLE_MIN}min"
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
|
@ -277,7 +277,7 @@ done
|
||||||
# P4: HOUSEKEEPING — stale processes
|
# P4: HOUSEKEEPING — stale processes
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# Check for dev-agent escalations
|
# Check for dev-agent escalations
|
||||||
ESCALATION_FILE="${FACTORY_ROOT}/factory/escalations.jsonl"
|
ESCALATION_FILE="${FACTORY_ROOT}/supervisor/escalations.jsonl"
|
||||||
if [ -s "$ESCALATION_FILE" ]; then
|
if [ -s "$ESCALATION_FILE" ]; then
|
||||||
ESCALATION_COUNT=$(wc -l < "$ESCALATION_FILE")
|
ESCALATION_COUNT=$(wc -l < "$ESCALATION_FILE")
|
||||||
p3 "Dev-agent escalated ${ESCALATION_COUNT} issue(s) — see ${ESCALATION_FILE}"
|
p3 "Dev-agent escalated ${ESCALATION_COUNT} issue(s) — see ${ESCALATION_FILE}"
|
||||||
|
|
@ -285,7 +285,7 @@ fi
|
||||||
|
|
||||||
status "P4: housekeeping"
|
status "P4: housekeeping"
|
||||||
|
|
||||||
# Stale factory-spawned claude processes (>3h, not caught by P0) — skip interactive sessions
|
# Stale agent-spawned claude processes (>3h, not caught by P0) — skip interactive sessions
|
||||||
STALE_CLAUDES=$(pgrep -f "claude -p" --older 10800 2>/dev/null || true)
|
STALE_CLAUDES=$(pgrep -f "claude -p" --older 10800 2>/dev/null || true)
|
||||||
if [ -n "$STALE_CLAUDES" ]; then
|
if [ -n "$STALE_CLAUDES" ]; then
|
||||||
echo "$STALE_CLAUDES" | xargs kill 2>/dev/null || true
|
echo "$STALE_CLAUDES" | xargs kill 2>/dev/null || true
|
||||||
|
|
@ -308,7 +308,7 @@ for wt in /tmp/${PROJECT_NAME}-worktree-* /tmp/${PROJECT_NAME}-review-*; do
|
||||||
done
|
done
|
||||||
git -C "$PROJECT_REPO_ROOT" worktree prune 2>/dev/null || true
|
git -C "$PROJECT_REPO_ROOT" worktree prune 2>/dev/null || true
|
||||||
|
|
||||||
# Rotate factory log if >5MB
|
# Rotate supervisor log if >5MB
|
||||||
for logfile in "${FACTORY_ROOT}"/{dev,review,factory}/*.log; do
|
for logfile in "${FACTORY_ROOT}"/{dev,review,factory}/*.log; do
|
||||||
if [ -f "$logfile" ]; then
|
if [ -f "$logfile" ]; then
|
||||||
SIZE_KB=$(du -k "$logfile" 2>/dev/null | cut -f1)
|
SIZE_KB=$(du -k "$logfile" 2>/dev/null | cut -f1)
|
||||||
|
|
@ -329,12 +329,12 @@ if [ -n "$ALL_ALERTS" ]; then
|
||||||
ALERT_TEXT=$(echo -e "$ALL_ALERTS")
|
ALERT_TEXT=$(echo -e "$ALL_ALERTS")
|
||||||
|
|
||||||
# Notify Matrix
|
# Notify Matrix
|
||||||
matrix_send "supervisor" "⚠️ Factory alerts:
|
matrix_send "supervisor" "⚠️ Supervisor alerts:
|
||||||
${ALERT_TEXT}" 2>/dev/null || true
|
${ALERT_TEXT}" 2>/dev/null || true
|
||||||
|
|
||||||
flog "Invoking claude -p for alerts"
|
flog "Invoking claude -p for alerts"
|
||||||
|
|
||||||
CLAUDE_PROMPT="$(cat "$PROMPT_FILE" 2>/dev/null || echo "You are a factory supervisor. Fix the issue below.")
|
CLAUDE_PROMPT="$(cat "$PROMPT_FILE" 2>/dev/null || echo "You are a supervisor agent. Fix the issue below.")
|
||||||
|
|
||||||
## Current Alerts
|
## Current Alerts
|
||||||
${ALERT_TEXT}
|
${ALERT_TEXT}
|
||||||
|
|
@ -2,15 +2,15 @@
|
||||||
# update-prompt.sh — Append a lesson to a best-practices file
|
# update-prompt.sh — Append a lesson to a best-practices file
|
||||||
#
|
#
|
||||||
# Usage:
|
# Usage:
|
||||||
# ./factory/update-prompt.sh "best-practices/memory.md" "### Title\nBody text"
|
# ./supervisor/update-prompt.sh "best-practices/memory.md" "### Title\nBody text"
|
||||||
# ./factory/update-prompt.sh --from-file "best-practices/memory.md" /tmp/lesson.md
|
# ./supervisor/update-prompt.sh --from-file "best-practices/memory.md" /tmp/lesson.md
|
||||||
#
|
#
|
||||||
# Called by claude -p when it learns something during a fix.
|
# Called by claude -p when it learns something during a fix.
|
||||||
# Commits and pushes the update to the disinto repo.
|
# Commits and pushes the update to the disinto repo.
|
||||||
|
|
||||||
source "$(dirname "$0")/../lib/env.sh"
|
source "$(dirname "$0")/../lib/env.sh"
|
||||||
|
|
||||||
TARGET_FILE="${FACTORY_ROOT}/factory/$1"
|
TARGET_FILE="${FACTORY_ROOT}/supervisor/$1"
|
||||||
shift
|
shift
|
||||||
|
|
||||||
if [ "$1" = "--from-file" ] && [ -f "$2" ]; then
|
if [ "$1" = "--from-file" ] && [ -f "$2" ]; then
|
||||||
|
|
@ -40,8 +40,8 @@ else
|
||||||
fi
|
fi
|
||||||
|
|
||||||
cd "$FACTORY_ROOT"
|
cd "$FACTORY_ROOT"
|
||||||
git add "factory/$1" 2>/dev/null || git add "$TARGET_FILE"
|
git add "supervisor/$1" 2>/dev/null || git add "$TARGET_FILE"
|
||||||
git commit -m "factory: learned — $(echo "$LESSON" | head -1 | sed 's/^#* *//')" --no-verify 2>/dev/null
|
git commit -m "supervisor: learned — $(echo "$LESSON" | head -1 | sed 's/^#* *//')" --no-verify 2>/dev/null
|
||||||
git push origin main 2>/dev/null
|
git push origin main 2>/dev/null
|
||||||
|
|
||||||
log "Updated $(basename "$TARGET_FILE") with new lesson"
|
log "Updated $(basename "$TARGET_FILE") with new lesson"
|
||||||
Loading…
Add table
Add a link
Reference in a new issue