refactor: rename factory/ → supervisor/, factory-poll → supervisor-poll

The supervisor agent was confusingly named "factory" (same as the project). Rename directory, script, log, lock, status, and escalation files. Update all references across scripts and docs. FACTORY_ROOT env var unchanged (refers to project root, not agent). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 18:06:25 +01:00 · 2026-03-15 18:06:25 +01:00 · 77cb4c4643
commit 77cb4c4643
parent 8d73c2f8f9
15 changed files with 68 additions and 68 deletions
--- a/supervisor/best-practices/ci.md
+++ b/supervisor/best-practices/ci.md
@ -0,0 +1,40 @@
+# CI Best Practices
+
+## Environment
+- Woodpecker CI at localhost:8000 (Docker backend)
+- Postgres DB: use `wpdb` helper from env.sh
+- Woodpecker API: use `woodpecker_api` helper from env.sh
+- Example (harb): CI images pre-built at `registry.niovi.voyage/harb/*:latest`
+
+## Safe Fixes
+- Retrigger CI: push empty commit to PR branch
+  ```bash
+  cd /tmp/${PROJECT_NAME}-worktree-<issue> && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin <branch> --force
+  ```
+- Restart woodpecker-agent: `sudo systemctl restart woodpecker-agent`
+- View pipeline status: `wpdb -c "SELECT number, status FROM pipelines WHERE repo_id=$WOODPECKER_REPO_ID ORDER BY number DESC LIMIT 5;"`
+- View failed steps: `bash ${FACTORY_ROOT}/lib/ci-debug.sh failures <pipeline-number>`
+- View step logs: `bash ${FACTORY_ROOT}/lib/ci-debug.sh logs <pipeline-number> <step-name>`
+
+## Dangerous (escalate)
+- Restarting woodpecker-server (drops all running pipelines)
+- Modifying pipeline configs in `.woodpecker/` directory
+
+## Known Issues
+- Codeberg rate-limits SSH clones. `git` step fails with exit 128. Retrigger usually works.
+- `log_entries` table grows fast (was 5.6GB once). Truncate periodically.
+- Example (harb): Running CI + harb stack = 14+ containers on 8GB. Memory pressure is real.
+- CI images take hours to rebuild. Never run `docker system prune -a`.
+
+## Lessons Learned
+- Exit code 128 on git step = Codeberg rate limit, not a code problem. Retrigger.
+- Exit code 137 = OOM kill. Check memory, kill stale processes, retrigger.
+- `node-quality` step fails on eslint/typescript errors — these need code fixes, not CI fixes.
+
+### Example (harb): FEE_DEST address must match DeployLocal.sol
+When DeployLocal.sol changes the feeDest address, bootstrap-common.sh must also be updated.
+Current feeDest = keccak256('harb.local.feeDest') = 0x8A9145E1Ea4C4d7FB08cF1011c8ac1F0e10F9383.
+Symptom: bootstrap step exits 1 after 'Granting recenter access to deployer' with no error — setRecenterAccess reverts because wrong address is impersonated.
+
+### Example (harb): keccak-derived FEE_DEST requires anvil_setBalance before impersonation
+When FEE_DEST is a keccak-derived address (e.g. keccak256('harb.local.feeDest')), it has zero ETH balance. Any function that calls `anvil_impersonateAccount` then `cast send --from $FEE_DEST --unlocked` will fail silently (output redirected to LOG_FILE) but exit 1 due to gas deduction failure. Fix: add `cast rpc anvil_setBalance "$FEE_DEST" "0xDE0B6B3A7640000"` before impersonation. Applied in both bootstrap-common.sh and red-team.sh.
--- a/supervisor/best-practices/codeberg.md
+++ b/supervisor/best-practices/codeberg.md
@ -0,0 +1,36 @@
+# Codeberg Best Practices
+
+## Rate Limiting
+Codeberg rate-limits SSH and HTTPS clones. Symptoms:
+- Woodpecker `git` step fails with exit code 128
+- Multiple pipelines fail in quick succession with the same error
+- Retriggers make it WORSE by adding more clone attempts
+
+### What To Do
+- **Do NOT retrigger** during a rate-limit storm. Wait 10-15 minutes.
+- Check if multiple pipelines failed on `git` step recently:
+  ```bash
+  wpdb -c "SELECT number, status, to_timestamp(started) FROM pipelines WHERE repo_id=$WOODPECKER_REPO_ID AND status='failure' ORDER BY number DESC LIMIT 5;"
+  wpdb -c "SELECT s.name, s.exit_code FROM steps s JOIN pipelines p ON s.pipeline_id=p.id WHERE p.number=<N> AND p.repo_id=$WOODPECKER_REPO_ID AND s.state='failure';"
+  ```
+- If multiple `git` failures with exit 128 in the last 15 min → it's rate limiting. Wait.
+- Only retrigger after 15+ minutes of no CI activity.
+
+### How To Retrigger Safely
+```bash
+cd <worktree> && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin <branch> --force
+```
+
+### Prevention
+- The system runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously.
+- One pipeline at a time is ideal on this VPS (resource + rate limit reasons).
+- If >3 pipelines are pending/running, do NOT create more work.
+
+## OAuth Tokens
+- OAuth tokens expire ~2h. If Codeberg is down during refresh, re-login required.
+- API token is in `~/.netrc` — read via `awk` in env.sh.
+- Review bot has a separate token ($REVIEW_BOT_TOKEN) for formal reviews.
+
+## Lessons Learned
+- Retrigger storm on 2026-03-12: supervisor + dev-agent both retriggered during rate limit, caused 5+ failed pipelines. Added cooldown awareness.
+- Empty commit retrigger works but adds noise to git history. Acceptable tradeoff.
--- a/supervisor/best-practices/dev-agent.md
+++ b/supervisor/best-practices/dev-agent.md
@ -0,0 +1,55 @@
+# Dev-Agent Best Practices
+
+## Architecture
+- `dev-poll.sh` (cron */10) → finds ready backlog issues → spawns `dev-agent.sh`
+- `dev-agent.sh` uses `claude -p` for implementation, runs in git worktree
+- Lock file: `/tmp/dev-agent.lock` (contains PID)
+- Status file: `/tmp/dev-agent-status`
+- Worktrees: `/tmp/${PROJECT_NAME}-worktree-<issue-number>/`
+
+## Safe Fixes
+- Remove stale lock: `rm -f /tmp/dev-agent.lock` (only if PID is dead)
+- Kill stuck agent: `kill <pid>` then clean lock
+- Restart on derailed PR: `bash ${FACTORY_ROOT}/dev/dev-agent.sh <issue-number> &`
+- Clean worktree: `cd $PROJECT_REPO_ROOT && git worktree remove /tmp/${PROJECT_NAME}-worktree-<N> --force`
+- Remove `in-progress` label if agent died without cleanup:
+  ```bash
+  codeberg_api DELETE "/issues/<N>/labels/in-progress"
+  ```
+
+## Dangerous (escalate)
+- Restarting agent on an issue that has an open PR with review changes — may lose context
+- Anything that modifies the PR branch history
+- Closing PRs or issues
+
+## Known Issues
+- `claude -p -c` (continue) fails if session was compacted — falls back to fresh `-p`
+- CI_FIX_COUNT is now reset on CI pass (fixed 2026-03-12), so each review phase gets fresh CI fix budget
+- Worktree creation fails if main repo has stale rebase — auto-heals now
+- Large text in jq `--arg` can break — write to file first
+- `$([ "$VAR" = true ] && echo "...")` crashes under `set -euo pipefail`
+
+## Lessons Learned
+- Agents don't have memory between tasks — full context must be in the prompt
+- Prior art injection (closed PR diffs) prevents rework
+- Feature issues MUST list affected e2e test files
+- CI fix loop is essential — first attempt rarely works
+- CLAUDE_TIMEOUT=7200 (2h) is needed for complex issues
+
+## Dependency Resolution
+
+**Trust closed state.** If a dependency issue is closed, the code is on the primary branch. Period.
+
+DO NOT try to find the specific PR that closed an issue. This is over-engineering that causes false negatives:
+- Codeberg shares issue/PR numbering — no guaranteed relationship
+- PRs don't always mention the issue number in title/body
+- Searching last N closed PRs misses older merges
+- The dev-agent closes issues after merging, so closed = merged
+
+The only check needed: `issue.state == "closed"`.
+
+### False Positive: Status Unchanged Alert
+The supervisor-poll alert 'status unchanged for Nmin' is a false positive for complex implementation tasks. The status is set to 'claude assessing + implementing' at the START of the `timeout 7200 claude -p ...` call and only updates after Claude finishes. Normal complex tasks (multi-file Solidity changes + forge test) take 45-90 minutes. To distinguish a false positive from a real stuck agent: check that the claude PID is alive (`ps -p <PID>`), consuming CPU (>0%), and has active threads (`pstree -p <PID>`). If the process is alive and using CPU, do NOT restart it — this wastes completed work.
+
+### False Positive: 'Waiting for CI + Review' Alert
+The 'status unchanged for Nmin' alert is also a false positive when status is 'waiting for CI + review on PR #N (round R)'. This is an intentional sleep/poll loop — the agent is waiting for CI to pass and then for review-poll to post a review. CI can take 20–40 minutes; review follows. Do NOT restart the agent. Confirm by checking: (1) agent PID is alive, (2) CI commit status via `codeberg_api GET /commits/<sha>/status`, (3) review-poll log shows it will pick up the PR on next cycle.
--- a/supervisor/best-practices/disk.md
+++ b/supervisor/best-practices/disk.md
@ -0,0 +1,24 @@
+# Disk Best Practices
+
+## Safe Fixes
+- Docker cleanup: `sudo docker system prune -f` (keeps images, removes stopped containers + dangling layers)
+- Truncate supervisor logs >5MB: `truncate -s 0 <file>`
+- Remove stale worktrees: check `/tmp/${PROJECT_NAME}-worktree-*`, only if dev-agent not running on them
+- Woodpecker log_entries: `DELETE FROM log_entries WHERE id < (SELECT max(id) - 100000 FROM log_entries);` then `VACUUM;`
+- Node module caches in worktrees: `rm -rf /tmp/${PROJECT_NAME}-worktree-*/node_modules/`
+- Git garbage collection: `cd $PROJECT_REPO_ROOT && git gc --prune=now`
+
+## Dangerous (escalate)
+- `docker system prune -a --volumes` — deletes ALL images including CI build cache
+- Deleting anything in `$PROJECT_REPO_ROOT/` that's tracked by git
+- Truncating Woodpecker DB tables other than log_entries
+
+## Known Disk Hogs
+- Woodpecker `log_entries` table: grows to 5GB+. Truncate periodically.
+- Docker overlay layers: survive normal prune. `-a` variant kills everything.
+- Git worktrees in /tmp: accumulate node_modules, build artifacts
+- Forge cache in `~/.foundry/cache/`: can grow large with many compilations
+
+## Lessons Learned
+- After truncating log_entries, run VACUUM FULL (reclaims actual disk space)
+- Docker ghost overlay layers need `prune -a` but that kills CI images — only do this if truly desperate
--- a/supervisor/best-practices/git.md
+++ b/supervisor/best-practices/git.md
@ -0,0 +1,61 @@
+# Git Best Practices
+
+## Environment
+- Repo: `$PROJECT_REPO_ROOT`, remote: `$PROJECT_REMOTE`
+- Branch: `$PRIMARY_BRANCH` (protected — no direct push, PRs only)
+- Worktrees: `/tmp/${PROJECT_NAME}-worktree-<issue>/`
+
+## Safe Fixes
+- Abort stale rebase: `cd $PROJECT_REPO_ROOT && git rebase --abort`
+- Switch to $PRIMARY_BRANCH: `git checkout $PRIMARY_BRANCH`
+- Prune worktrees: `git worktree prune`
+- Reset dirty state: `git checkout -- .` (only uncommitted changes)
+- Fetch latest: `git fetch origin $PRIMARY_BRANCH`
+
+## Auto-fixable by Supervisor
+- **Merge conflict on approved PR**: rebase onto $PRIMARY_BRANCH and force-push
+  ```bash
+  cd /tmp/${PROJECT_NAME}-worktree-<issue> || git worktree add /tmp/${PROJECT_NAME}-worktree-<issue> <branch>
+  cd /tmp/${PROJECT_NAME}-worktree-<issue>
+  git fetch origin $PRIMARY_BRANCH
+  git rebase origin/$PRIMARY_BRANCH
+  # If conflict is trivial (NatSpec, comments): resolve and continue
+  # If conflict is code logic: escalate to Clawy
+  git push origin <branch> --force
+  ```
+- **Stale rebase**: `git rebase --abort && git checkout $PRIMARY_BRANCH`
+- **Wrong branch**: `git checkout $PRIMARY_BRANCH`
+
+## Dangerous (escalate)
+- `git reset --hard` on any branch with unpushed work
+- Deleting remote branches
+- Force-pushing to any branch
+- Anything on the $PRIMARY_BRANCH branch directly
+
+## Known Issues
+- Main repo MUST be on $PRIMARY_BRANCH at all times. Dev work happens in worktrees.
+- Stale rebases (detached HEAD) break all worktree creation — silent pipeline stall.
+- `git worktree add` fails if target directory exists (even empty). Remove first.
+- Many old branches exist locally (100+). Normal — don't bulk-delete.
+
+## Evolution Pipeline
+- The evolution pipeline (`tools/push3-evolution/evolve.sh`) temporarily modifies
+  `onchain/src/OptimizerV3.sol` and `onchain/src/OptimizerV3Push3.sol` during runs.
+- **DO NOT revert these files while evolution is running** (check: `pgrep -f evolve.sh`).
+- If `/tmp/evolution.pid` exists and the PID is alive, the dirty state is intentional.
+- Evolution will restore the files when it finishes.
+
+## Lessons Learned
+- NEVER delete remote branches before confirming merge. Close PR, rebase locally, force-push if needed.
+- Stale rebase caused 5h pipeline stall once (2026-03-11). Auto-heal added to dev-agent.
+- lint-staged hooks fail when `forge` not in PATH. Use `--no-verify` when committing from scripts.
+
+### PR #608 Post-Mortem (2026-03-12/13)
+PR sat blocked for 24 hours while 21 other PRs merged. Root causes:
+1. **Supervisor didn't detect merge conflicts** — only checked CI state, not `mergeable`. Fixed: now checks `mergeable=false` as first condition.
+2. **Supervisor didn't detect stale REQUEST_CHANGES** — review bot requested changes, dev-agent never came back to fix them, moved on to other issues. Need: detect "PR has REQUEST_CHANGES older than N hours with no new push."
+3. **No staleness kill switch** — after N merge conflicts or N days, a PR should be auto-closed and the issue reopened for a fresh attempt. Rebasing across 21 commits is more work than starting over.
+
+**Rules derived:**
+- Supervisor should close PRs that are >24h old with merge conflicts and no recent activity. Reopen the parent issue with a note pointing to the closed PR as prior art.
+- Dev-agent must not abandon a PR with REQUEST_CHANGES — either fix or close it before moving to new work.
--- a/supervisor/best-practices/memory.md
+++ b/supervisor/best-practices/memory.md
@ -0,0 +1,29 @@
+# Memory Best Practices
+
+## Environment
+- VPS: 8GB RAM, 4GB swap, Debian
+- Running: Docker stack (8 containers), Woodpecker CI, OpenClaw gateway
+
+## Safe Fixes (no permission needed)
+- Kill stale `claude` processes (>3h old): `pgrep -f "claude" --older 10800 | xargs kill`
+- Drop filesystem caches: `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`
+- Restart bloated Anvil: `sudo docker restart ${PROJECT_NAME}-anvil-1` (grows to 12GB+ over hours)
+- Kill orphan node processes from dead worktrees
+
+## Dangerous (escalate)
+- `docker system prune -a --volumes` — kills CI images, hours to rebuild
+- Stopping project stack containers — breaks dev environment
+- OOM that survives all safe fixes — needs human decision on what to kill
+
+## Known Memory Hogs
+- `claude` processes from dev-agent: 200MB+ each, can zombie
+- `dockerd`: 600MB+ baseline (normal)
+- `openclaw-gateway`: 500MB+ (normal)
+- Anvil container: starts small, grows unbounded over hours
+- `forge build` with via_ir: can spike to 4GB+. Use `--skip test script` to reduce.
+- Vite dev servers inside containers: 150MB+ each
+
+## Lessons Learned
+- After killing processes, always `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`
+- Swap doesn't drain from dropping caches alone — it's actual paged-out process memory
+- Running CI + full project stack = 14+ containers on 8GB. Only one pipeline at a time.
--- a/supervisor/best-practices/review-agent.md
+++ b/supervisor/best-practices/review-agent.md
@ -0,0 +1,30 @@
+# Review Agent Best Practices
+
+## Architecture
+- `review-poll.sh` (cron */10) → finds open PRs with CI pass + no review → spawns `review-pr.sh`
+- `review-pr.sh` uses `claude -p` to review the diff, posts structured comment
+- Uses `review_bot` Codeberg account for formal reviews (separate from main account)
+- Skips WIP/draft PRs (`[WIP]` in title or draft flag)
+
+## Safe Fixes
+- Manually trigger review: `bash ${FACTORY_ROOT}/review/review-pr.sh <pr-number>`
+- Force re-review: `bash ${FACTORY_ROOT}/review/review-pr.sh <pr-number> --force`
+- Check review log: `tail -20 ${FACTORY_ROOT}/review/review.log`
+
+## Common Failures
+- **"SKIP: CI=failure"** — review bot won't review until CI passes. Fix CI first.
+- **"already reviewed"** — bot checks `<!-- reviewed: SHA -->` comment marker. Use `--force` to override.
+- **Review error comment** — uses `<!-- review-error: SHA -->` marker, does NOT count as reviewed. Bot should retry automatically.
+- **Self-narration collapse** — bot sometimes narrates instead of producing structured JSON. JSON output format in the prompt prevents this.
+- **Hallucinated findings** — bot may flag non-issues. This needs Clawy's judgment — escalate.
+
+## Monitoring
+- Unreviewed PRs with CI pass for >1h → supervisor-poll.sh auto-triggers review
+- Review errors should resolve on next poll cycle
+- If same PR fails review 3+ times → likely a prompt issue, escalate
+
+## Lessons Learned
+- Review bot must output JSON — prevents self-narration collapse
+- DISCUSS verdict should be treated same as REQUEST_CHANGES by dev-agent
+- Error comments must NOT include `<!-- reviewed: SHA -->` — would falsely mark as reviewed
+- Review bot uses Codeberg formal reviews API — branch protection requires different user than PR author