fix: {project}-ops repo — separate operations from code (#757) (#767)

Fixes #757 ## Changes Separate operations from code into {project}-ops repo pattern. Added OPS_REPO_ROOT infrastructure (env.sh, load-project.sh, formula-session.sh with ensure_ops_repo helper). Updated all 8 agent scripts and 7 formulas to read/write vault items, journals, evidence, prerequisites, RESOURCES.md, and knowledge from the ops repo. Added setup_ops_repo() to disinto init for automatic ops repo creation and seeding. Removed migrated data from code repo (vault data dirs, planner journal/memory/prerequisites, supervisor journal/best-practices, evidence, RESOURCES.md). Updated all documentation. 55 files changed, ShellCheck clean, all 38 phase tests pass. Co-authored-by: openhands <openhands@all-hands.dev> Reviewed-on: https://codeberg.org/johba/disinto/pulls/767 Reviewed-by: Disinto_bot <disinto_bot@noreply.codeberg.org>
2026-03-26 19:55:12 +01:00 · 2026-03-26 19:55:12 +01:00 · 71fe89cdd0
commit 71fe89cdd0
parent a899fd0733
55 changed files with 421 additions and 932 deletions
--- a/supervisor/AGENTS.md
+++ b/supervisor/AGENTS.md
@ -31,10 +31,9 @@ runs directly from cron like the planner and predictor.
 - `formulas/run-supervisor.toml` — Execution spec: five steps (preflight review,
  health-assessment, decide-actions, report, journal) with `needs` dependencies.
  Claude evaluates all metrics and takes actions in a single interactive session
- `supervisor/journal/*.md` — Daily health logs from each supervisor run (local,
-  committed periodically)
+- `$OPS_REPO_ROOT/journal/supervisor/*.md` — Daily health logs from each supervisor run
 - `supervisor/PROMPT.md` — Best-practices reference for remediation actions
- `supervisor/best-practices/*.md` — Domain-specific remediation guides (memory,
+- `$OPS_REPO_ROOT/knowledge/*.md` — Domain-specific remediation guides (memory,
  disk, CI, git, dev-agent, review-agent, forge)
 - `supervisor/supervisor-poll.sh` — Legacy bash orchestrator (superseded by
  supervisor-run.sh + formula)
@ -43,7 +42,7 @@ runs directly from cron like the planner and predictor.
 P3 (degraded PRs, circular deps, stale deps), P4 (housekeeping).

 **Environment variables consumed**:
- `FORGE_TOKEN`, `FORGE_SUPERVISOR_TOKEN` (falls back to FORGE_TOKEN), `FORGE_REPO`, `FORGE_API`, `PROJECT_NAME`, `PROJECT_REPO_ROOT`
+- `FORGE_TOKEN`, `FORGE_SUPERVISOR_TOKEN` (falls back to FORGE_TOKEN), `FORGE_REPO`, `FORGE_API`, `PROJECT_NAME`, `PROJECT_REPO_ROOT`, `OPS_REPO_ROOT`
 - `PRIMARY_BRANCH`, `CLAUDE_MODEL` (set to sonnet by supervisor-run.sh)
 - `WOODPECKER_TOKEN`, `WOODPECKER_SERVER`, `WOODPECKER_DB_PASSWORD`, `WOODPECKER_DB_USER`, `WOODPECKER_DB_HOST`, `WOODPECKER_DB_NAME` — CI database queries

--- a/supervisor/PROMPT.md
+++ b/supervisor/PROMPT.md
@ -15,14 +15,14 @@ You are the supervisor agent for `$FORGE_REPO`. You were called because

 Fix the issue yourself. You have full shell access and `--dangerously-skip-permissions`.

-Before acting, read the relevant best-practices file:
- Memory issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/memory.md`
- Disk issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/disk.md`
- CI issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/ci.md`
- forge / rate limits → `cat ${FACTORY_ROOT}/supervisor/best-practices/forge.md`
- Dev-agent issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/dev-agent.md`
- Review-agent issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/review-agent.md`
- Git issues → `cat ${FACTORY_ROOT}/supervisor/best-practices/git.md`
+Before acting, read the relevant knowledge file from the ops repo:
+- Memory issues → `cat ${OPS_REPO_ROOT}/knowledge/memory.md`
+- Disk issues → `cat ${OPS_REPO_ROOT}/knowledge/disk.md`
+- CI issues → `cat ${OPS_REPO_ROOT}/knowledge/ci.md`
+- forge / rate limits → `cat ${OPS_REPO_ROOT}/knowledge/forge.md`
+- Dev-agent issues → `cat ${OPS_REPO_ROOT}/knowledge/dev-agent.md`
+- Review-agent issues → `cat ${OPS_REPO_ROOT}/knowledge/review-agent.md`
+- Git issues → `cat ${OPS_REPO_ROOT}/knowledge/git.md`

 ## Credentials & API Access

@ -83,7 +83,7 @@ When you see "Dev-agent blocked: last N polls all report 'no ready issues'":

 File a vault procurement item so the human is notified through the vault:
 ```bash
-cat > "${PROJECT_REPO_ROOT}/vault/pending/supervisor-$(date -u +%Y%m%d-%H%M)-issue.md" <<'VAULT_EOF'
+cat > "${OPS_REPO_ROOT}/vault/pending/supervisor-$(date -u +%Y%m%d-%H%M)-issue.md" <<'VAULT_EOF'
 # <What is needed>
 ## What
 <description of the problem and why the supervisor cannot fix it>
@ -106,13 +106,13 @@ FIXED: <what you did>
 ```
 or
 ```
-VAULT: filed vault/pending/<id>.md — <what's needed>
+VAULT: filed $OPS_REPO_ROOT/vault/pending/<id>.md — <what's needed>
 ```

 ## Learning

-If you discover something new, append it to the relevant best-practices file:
+If you discover something new, append it to the relevant knowledge file in the ops repo:
 ```bash
-bash ${FACTORY_ROOT}/supervisor/update-prompt.sh "best-practices/<file>.md" "### Lesson title
-Description of what you learned."
+echo "### Lesson title
+Description of what you learned." >> "${OPS_REPO_ROOT}/knowledge/<file>.md"
 ```
--- a/supervisor/best-practices/ci.md
+++ b/supervisor/best-practices/ci.md
@ -1,45 +0,0 @@
-# CI Best Practices
-
-## Environment
- Woodpecker CI at localhost:8000 (Docker backend)
- Postgres DB: use `wpdb` helper from env.sh
- Woodpecker API: use `woodpecker_api` helper from env.sh
- Example (harb): CI images pre-built at `registry.niovi.voyage/harb/*:latest`
-
-## Safe Fixes
- Retrigger CI (preferred, automated): Woodpecker API POST
-  ```bash
-  woodpecker_api "/repos/${WOODPECKER_REPO_ID}/pipelines/${PIPELINE_NUMBER}" -X POST
-  ```
-  supervisor-poll.sh does this automatically for infra failures (max 2 retries).
- Retrigger CI (manual fallback): push empty commit to PR branch
-  ```bash
-  cd /tmp/${PROJECT_NAME}-worktree-<issue> && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin <branch> --force
-  ```
- Restart woodpecker-agent: `sudo systemctl restart woodpecker-agent`
- View pipeline status: `wpdb -c "SELECT number, status FROM pipelines WHERE repo_id=$WOODPECKER_REPO_ID ORDER BY number DESC LIMIT 5;"`
- View failed steps: `bash ${FACTORY_ROOT}/lib/ci-debug.sh failures <pipeline-number>`
- View step logs: `bash ${FACTORY_ROOT}/lib/ci-debug.sh logs <pipeline-number> <step-name>`
-
-## Dangerous (escalate)
- Restarting woodpecker-server (drops all running pipelines)
- Modifying pipeline configs in `.woodpecker/` directory
-
-## Known Issues
- forge rate-limits SSH clones. `git` step fails with exit 128. Retrigger usually works.
- `log_entries` table grows fast (was 5.6GB once). Truncate periodically.
- Example (harb): Running CI + harb stack = 14+ containers on 8GB. Memory pressure is real.
- CI images take hours to rebuild. Never run `docker system prune -a`.
-
-## Lessons Learned
- Exit code 128 on git step = forge rate limit, not a code problem. Retrigger.
- Exit code 137 = OOM kill. Check memory, kill stale processes, retrigger.
- `node-quality` step fails on eslint/typescript errors — these need code fixes, not CI fixes.
-
-### Example (harb): FEE_DEST address must match DeployLocal.sol
-When DeployLocal.sol changes the feeDest address, bootstrap-common.sh must also be updated.
-Current feeDest = keccak256('harb.local.feeDest') = 0x8A9145E1Ea4C4d7FB08cF1011c8ac1F0e10F9383.
-Symptom: bootstrap step exits 1 after 'Granting recenter access to deployer' with no error — setRecenterAccess reverts because wrong address is impersonated.
-
-### Example (harb): keccak-derived FEE_DEST requires anvil_setBalance before impersonation
-When FEE_DEST is a keccak-derived address (e.g. keccak256('harb.local.feeDest')), it has zero ETH balance. Any function that calls `anvil_impersonateAccount` then `cast send --from $FEE_DEST --unlocked` will fail silently (output redirected to LOG_FILE) but exit 1 due to gas deduction failure. Fix: add `cast rpc anvil_setBalance "$FEE_DEST" "0xDE0B6B3A7640000"` before impersonation. Applied in both bootstrap-common.sh and red-team.sh.
--- a/supervisor/best-practices/dev-agent.md
+++ b/supervisor/best-practices/dev-agent.md
@ -1,93 +0,0 @@
-# Dev-Agent Best Practices
-
-## Architecture
- `dev-poll.sh` (cron */10) → finds ready backlog issues → spawns `dev-agent.sh`
- `dev-agent.sh` uses `claude -p` for implementation, runs in git worktree
- Lock file: `/tmp/dev-agent.lock` (contains PID)
- Status file: `/tmp/dev-agent-status`
- Worktrees: `/tmp/${PROJECT_NAME}-worktree-<issue-number>/`
-
-## Safe Fixes
- Remove stale lock: `rm -f /tmp/dev-agent.lock` (only if PID is dead)
- Kill stuck agent: `kill <pid>` then clean lock
- Restart on derailed PR: `bash ${FACTORY_ROOT}/dev/dev-agent.sh <issue-number> &`
- Clean worktree: `cd $PROJECT_REPO_ROOT && git worktree remove /tmp/${PROJECT_NAME}-worktree-<N> --force`
- Remove `in-progress` label if agent died without cleanup:
-  ```bash
-  forge_api DELETE "/issues/<N>/labels/in-progress"
-  ```
-
-## Dangerous (escalate)
- Restarting agent on an issue that has an open PR with review changes — may lose context
- Anything that modifies the PR branch history
- Closing PRs or issues
-
-## Known Issues
- `claude -p -c` (continue) fails if session was compacted — falls back to fresh `-p`
- CI_FIX_COUNT is now reset on CI pass (fixed 2026-03-12), so each review phase gets fresh CI fix budget
- Worktree creation fails if main repo has stale rebase — auto-heals now
- Large text in jq `--arg` can break — write to file first
- `$([ "$VAR" = true ] && echo "...")` crashes under `set -euo pipefail`
-
-## Lessons Learned
- Agents don't have memory between tasks — full context must be in the prompt
- Prior art injection (closed PR diffs) prevents rework
- Feature issues MUST list affected e2e test files
- CI fix loop is essential — first attempt rarely works
- CLAUDE_TIMEOUT=7200 (2h) is needed for complex issues
-
-## Dependency Resolution
-
-**Trust closed state.** If a dependency issue is closed, the code is on the primary branch. Period.
-
-DO NOT try to find the specific PR that closed an issue. This is over-engineering that causes false negatives:
- forge shares issue/PR numbering — no guaranteed relationship
- PRs don't always mention the issue number in title/body
- Searching last N closed PRs misses older merges
- The dev-agent closes issues after merging, so closed = merged
-
-The only check needed: `issue.state == "closed"`.
-
-### False Positive: Status Unchanged Alert
-The supervisor-poll alert 'status unchanged for Nmin' is a false positive for complex implementation tasks. The status is set to 'claude assessing + implementing' at the START of the `timeout 7200 claude -p ...` call and only updates after Claude finishes. Normal complex tasks (multi-file Solidity changes + forge test) take 45-90 minutes. To distinguish a false positive from a real stuck agent: check that the claude PID is alive (`ps -p <PID>`), consuming CPU (>0%), and has active threads (`pstree -p <PID>`). If the process is alive and using CPU, do NOT restart it — this wastes completed work.
-
-### False Positive: 'Waiting for CI + Review' Alert
-The 'status unchanged for Nmin' alert is also a false positive when status is 'waiting for CI + review on PR #N (round R)'. This is an intentional sleep/poll loop — the agent is waiting for CI to pass and then for review-poll to post a review. CI can take 20–40 minutes; review follows. Do NOT restart the agent. Confirm by checking: (1) agent PID is alive, (2) CI commit status via `forge_api GET /commits/<sha>/status`, (3) review-poll log shows it will pick up the PR on next cycle.
-
-### False Positive: Shared Status File Causes Giant Age (29M+ min)
-When the status file `/tmp/dev-agent-status` doesn't exist, `stat -c %Y` fails and the supervisor falls back to epoch 0. The computed age is then `NOW_EPOCH/60 ≈ 29,567,290 min`, which is unmistakably a false positive.
-Root cause: the status file is not per-project (tracked as disinto issue #423). It can be missing if: (1) the agent has not written to it yet, (2) cleanup ran early, or (3) another project's cleanup deleted it.
-Fix: confirm the agent PID is alive and the tmux session shows active work, then touch the file: `printf '[%s] dev-agent #NNN: <phase> (<project>)\n' "$(date -u '+%Y-%m-%d %H:%M:%S UTC')" > /tmp/dev-agent-status`. This clears the alert without restarting anything.
-
-### PR CI vs Push CI mismatch causes silent stall in awaiting_review
-When push CI passes but PR CI fails (e.g., a duplicate-detection step only runs on pull_request events), the phase-handler transitions to PHASE:awaiting_review without detecting the PR CI failure. The agent then sleeps in the review-poll loop indefinitely.
-
-Symptom: PR CI=failure but dev-agent phase=awaiting_review, status shows 'waiting for CI + review'.
-
-Fix: inject the CI failure info into the Claude session with agent_inject_into_session, pointing to the duplicate blocks and telling Claude to fix + push + write PHASE:awaiting_ci. The phase-handler's awaiting_review loop checks for phase file mtime changes every 5 min and will re-enter the main loop automatically.
-
-### Push CI vs PR CI mismatch — agent picks wrong pipeline number
-When the phase-handler injects 'CI failed' with a push pipeline number (e.g. #622), the agent checks that push pipeline, finds it passed, and concludes 'CI OK' — setting PHASE:awaiting_review despite the PR pipeline (#623) being the one that actually failed.
-Root cause: the injected event does not always carry the correct pipeline number.
-Symptom: agent in awaiting_review with PR CI=failure and push CI=success.
-Fix: inject with explicit pipeline #623 (the pull_request event pipeline), point to the failing step and the specific duplicate blocks to fix. Use: woodpecker_api /repos/4/pipelines?event=pull_request (or look for event=pull_request in recent pipelines list) to find the correct pipeline number before injecting.
-
-### Race Condition: Review Posted Before PHASE:awaiting_review Transitions
-**Symptom:** Dev-agent status unchanged at 'waiting for review on PR #N', no `review-injected-disinto-N` sentinel, but a formal review already exists on forge and `/tmp/disinto-review-output-N.json` was written before the phase file updated.
-
-**Root cause:** review-pr.sh runs while the dev-agent is still in PHASE:awaiting_ci. inject_review_into_dev_session returns early (phase check fails). On subsequent review-poll cycles, the PR is skipped (formal review already exists for SHA), so inject is never called again.
-
-**Fix:** Manually inject the review:
-```bash
-source /home/debian/dark-factory/lib/env.sh
-PROJECT_TOML=/home/debian/dark-factory/projects/disinto.toml
-source /home/debian/dark-factory/lib/load-project.sh "$PROJECT_TOML"
-PHASE_FILE="/tmp/dev-session-${PROJECT_NAME}-<ISSUE>.phase"
-PR_NUM=<N>; PR_BRANCH="fix/issue-<ISSUE>"; PR_SHA=$(cat /tmp/dev-session-${PROJECT_NAME}-<ISSUE>.phase | grep SHA | cut -d: -f2 || git -C $PROJECT_REPO_ROOT rev-parse origin/$PR_BRANCH)
-REVIEW_TEXT=$(curl -sf -H "Authorization: token ${FORGE_TOKEN}" "${FORGE_API}/issues/${PR_NUM}/comments?limit=50" | jq -r --arg sha "$PR_SHA" '[.[] | select(.body | contains("<!-- reviewed: " + $sha))] | last // empty | .body')
-INJECT_MSG="Review: REQUEST_CHANGES on PR #${PR_NUM}:\n\n${REVIEW_TEXT}\n\nInstructions:\n1. Address each piece of feedback carefully.\n2. Run lint and tests when done.\n3. Commit your changes and push: git push origin ${PR_BRANCH}\n4. Write: echo PHASE:awaiting_ci > "${PHASE_FILE}"\n5. Stop and wait for the next CI result."
-INJECT_TMP=$(mktemp); printf '%s' "$INJECT_MSG" > "$INJECT_TMP"
-tmux load-buffer -b inject "$INJECT_TMP" && tmux paste-buffer -t "dev-${PROJECT_NAME}-<ISSUE>" -b inject && sleep 0.5 && tmux send-keys -t "dev-${PROJECT_NAME}-<ISSUE>" '' Enter
-touch "/tmp/review-injected-${PROJECT_NAME}-${PR_NUM}"
-```
-Then update /tmp/dev-agent-status to reflect current work.
--- a/supervisor/best-practices/disk.md
+++ b/supervisor/best-practices/disk.md
@ -1,24 +0,0 @@
-# Disk Best Practices
-
-## Safe Fixes
- Docker cleanup: `sudo docker system prune -f` (keeps images, removes stopped containers + dangling layers)
- Truncate supervisor logs >5MB: `truncate -s 0 <file>`
- Remove stale worktrees: check `/tmp/${PROJECT_NAME}-worktree-*`, only if dev-agent not running on them
- Woodpecker log_entries: `DELETE FROM log_entries WHERE id < (SELECT max(id) - 100000 FROM log_entries);` then `VACUUM;`
- Node module caches in worktrees: `rm -rf /tmp/${PROJECT_NAME}-worktree-*/node_modules/`
- Git garbage collection: `cd $PROJECT_REPO_ROOT && git gc --prune=now`
-
-## Dangerous (escalate)
- `docker system prune -a --volumes` — deletes ALL images including CI build cache
- Deleting anything in `$PROJECT_REPO_ROOT/` that's tracked by git
- Truncating Woodpecker DB tables other than log_entries
-
-## Known Disk Hogs
- Woodpecker `log_entries` table: grows to 5GB+. Truncate periodically.
- Docker overlay layers: survive normal prune. `-a` variant kills everything.
- Git worktrees in /tmp: accumulate node_modules, build artifacts
- Forge cache in `~/.foundry/cache/`: can grow large with many compilations
-
-## Lessons Learned
- After truncating log_entries, run VACUUM FULL (reclaims actual disk space)
- Docker ghost overlay layers need `prune -a` but that kills CI images — only do this if truly desperate
--- a/supervisor/best-practices/forge.md
+++ b/supervisor/best-practices/forge.md
@ -1,36 +0,0 @@
-# Forge Best Practices
-
-## Rate Limiting
-The forge (Forgejo/Gitea) may rate-limit SSH and HTTPS clones. Symptoms:
- Woodpecker `git` step fails with exit code 128
- Multiple pipelines fail in quick succession with the same error
- Retriggers make it WORSE by adding more clone attempts
-
-### What To Do
- **Do NOT retrigger** during a rate-limit storm. Wait 10-15 minutes.
- Check if multiple pipelines failed on `git` step recently:
-  ```bash
-  wpdb -c "SELECT number, status, to_timestamp(started) FROM pipelines WHERE repo_id=$WOODPECKER_REPO_ID AND status='failure' ORDER BY number DESC LIMIT 5;"
-  wpdb -c "SELECT s.name, s.exit_code FROM steps s JOIN pipelines p ON s.pipeline_id=p.id WHERE p.number=<N> AND p.repo_id=$WOODPECKER_REPO_ID AND s.state='failure';"
-  ```
- If multiple `git` failures with exit 128 in the last 15 min → it's rate limiting. Wait.
- Only retrigger after 15+ minutes of no CI activity.
-
-### How To Retrigger Safely
-```bash
-cd <worktree> && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin <branch> --force
-```
-
-### Prevention
- The system runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously.
- One pipeline at a time is ideal on this VPS (resource + rate limit reasons).
- If >3 pipelines are pending/running, do NOT create more work.
-
-## API Tokens
- API token is in `.env` as `FORGE_TOKEN` — loaded via env.sh.
- Review bot has a separate token (`$FORGE_REVIEW_TOKEN`) for formal reviews.
- With local Forgejo, tokens don't expire. For remote forges, check provider docs.
-
-## Lessons Learned
- Retrigger storm on 2026-03-12: supervisor + dev-agent both retriggered during rate limit, caused 5+ failed pipelines. Added cooldown awareness.
- Empty commit retrigger works but adds noise to git history. Acceptable tradeoff.
--- a/supervisor/best-practices/git.md
+++ b/supervisor/best-practices/git.md
@ -1,61 +0,0 @@
-# Git Best Practices
-
-## Environment
- Repo: `$PROJECT_REPO_ROOT`, remote: `$PROJECT_REMOTE`
- Branch: `$PRIMARY_BRANCH` (protected — no direct push, PRs only)
- Worktrees: `/tmp/${PROJECT_NAME}-worktree-<issue>/`
-
-## Safe Fixes
- Abort stale rebase: `cd $PROJECT_REPO_ROOT && git rebase --abort`
- Switch to $PRIMARY_BRANCH: `git checkout $PRIMARY_BRANCH`
- Prune worktrees: `git worktree prune`
- Reset dirty state: `git checkout -- .` (only uncommitted changes)
- Fetch latest: `git fetch origin $PRIMARY_BRANCH`
-
-## Auto-fixable by Supervisor
- **Merge conflict on approved PR**: rebase onto $PRIMARY_BRANCH and force-push
-  ```bash
-  cd /tmp/${PROJECT_NAME}-worktree-<issue> || git worktree add /tmp/${PROJECT_NAME}-worktree-<issue> <branch>
-  cd /tmp/${PROJECT_NAME}-worktree-<issue>
-  git fetch origin $PRIMARY_BRANCH
-  git rebase origin/$PRIMARY_BRANCH
-  # If conflict is trivial (NatSpec, comments): resolve and continue
-  # If conflict is code logic: escalate to Clawy
-  git push origin <branch> --force
-  ```
- **Stale rebase**: `git rebase --abort && git checkout $PRIMARY_BRANCH`
- **Wrong branch**: `git checkout $PRIMARY_BRANCH`
-
-## Dangerous (escalate)
- `git reset --hard` on any branch with unpushed work
- Deleting remote branches
- Force-pushing to any branch
- Anything on the $PRIMARY_BRANCH branch directly
-
-## Known Issues
- Main repo MUST be on $PRIMARY_BRANCH at all times. Dev work happens in worktrees.
- Stale rebases (detached HEAD) break all worktree creation — silent pipeline stall.
- `git worktree add` fails if target directory exists (even empty). Remove first.
- Many old branches exist locally (100+). Normal — don't bulk-delete.
-
-## Evolution Pipeline
- The evolution pipeline (`tools/push3-evolution/evolve.sh`) temporarily modifies
-  `onchain/src/OptimizerV3.sol` and `onchain/src/OptimizerV3Push3.sol` during runs.
- **DO NOT revert these files while evolution is running** (check: `pgrep -f evolve.sh`).
- If `/tmp/evolution.pid` exists and the PID is alive, the dirty state is intentional.
- Evolution will restore the files when it finishes.
-
-## Lessons Learned
- NEVER delete remote branches before confirming merge. Close PR, rebase locally, force-push if needed.
- Stale rebase caused 5h pipeline stall once (2026-03-11). Auto-heal added to dev-agent.
- lint-staged hooks fail when `forge` not in PATH. Use `--no-verify` when committing from scripts.
-
-### PR #608 Post-Mortem (2026-03-12/13)
-PR sat blocked for 24 hours while 21 other PRs merged. Root causes:
-1. **Supervisor didn't detect merge conflicts** — only checked CI state, not `mergeable`. Fixed: now checks `mergeable=false` as first condition.
-2. **Supervisor didn't detect stale REQUEST_CHANGES** — review bot requested changes, dev-agent never came back to fix them, moved on to other issues. Need: detect "PR has REQUEST_CHANGES older than N hours with no new push."
-3. **No staleness kill switch** — after N merge conflicts or N days, a PR should be auto-closed and the issue reopened for a fresh attempt. Rebasing across 21 commits is more work than starting over.
-
-**Rules derived:**
- Supervisor should close PRs that are >24h old with merge conflicts and no recent activity. Reopen the parent issue with a note pointing to the closed PR as prior art.
- Dev-agent must not abandon a PR with REQUEST_CHANGES — either fix or close it before moving to new work.
--- a/supervisor/best-practices/memory.md
+++ b/supervisor/best-practices/memory.md
@ -1,29 +0,0 @@
-# Memory Best Practices
-
-## Environment
- VPS: 8GB RAM, 4GB swap, Debian
- Running: Docker stack (8 containers), Woodpecker CI, OpenClaw gateway
-
-## Safe Fixes (no permission needed)
- Kill stale `claude` processes (>3h old): `pgrep -f "claude" --older 10800 | xargs kill`
- Drop filesystem caches: `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`
- Restart bloated Anvil: `sudo docker restart ${PROJECT_NAME}-anvil-1` (grows to 12GB+ over hours)
- Kill orphan node processes from dead worktrees
-
-## Dangerous (escalate)
- `docker system prune -a --volumes` — kills CI images, hours to rebuild
- Stopping project stack containers — breaks dev environment
- OOM that survives all safe fixes — needs human decision on what to kill
-
-## Known Memory Hogs
- `claude` processes from dev-agent: 200MB+ each, can zombie
- `dockerd`: 600MB+ baseline (normal)
- `openclaw-gateway`: 500MB+ (normal)
- Anvil container: starts small, grows unbounded over hours
- `forge build` with via_ir: can spike to 4GB+. Use `--skip test script` to reduce.
- Vite dev servers inside containers: 150MB+ each
-
-## Lessons Learned
- After killing processes, always `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`
- Swap doesn't drain from dropping caches alone — it's actual paged-out process memory
- Running CI + full project stack = 14+ containers on 8GB. Only one pipeline at a time.
--- a/supervisor/best-practices/review-agent.md
+++ b/supervisor/best-practices/review-agent.md
@ -1,30 +0,0 @@
-# Review Agent Best Practices
-
-## Architecture
- `review-poll.sh` (cron */10) → finds open PRs with CI pass + no review → spawns `review-pr.sh`
- `review-pr.sh` uses `claude -p` to review the diff, posts structured comment
- Uses `review_bot` forge account for formal reviews (separate from main account)
- Skips WIP/draft PRs (`[WIP]` in title or draft flag)
-
-## Safe Fixes
- Manually trigger review: `bash ${FACTORY_ROOT}/review/review-pr.sh <pr-number>`
- Force re-review: `bash ${FACTORY_ROOT}/review/review-pr.sh <pr-number> --force`
- Check review log: `tail -20 ${FACTORY_ROOT}/review/review.log`
-
-## Common Failures
- **"SKIP: CI=failure"** — review bot won't review until CI passes. Fix CI first.
- **"already reviewed"** — bot checks `<!-- reviewed: SHA -->` comment marker. Use `--force` to override.
- **Review error comment** — uses `<!-- review-error: SHA -->` marker, does NOT count as reviewed. Bot should retry automatically.
- **Self-narration collapse** — bot sometimes narrates instead of producing structured JSON. JSON output format in the prompt prevents this.
- **Hallucinated findings** — bot may flag non-issues. This needs Clawy's judgment — escalate.
-
-## Monitoring
- Unreviewed PRs with CI pass for >1h → supervisor-poll.sh auto-triggers review
- Review errors should resolve on next poll cycle
- If same PR fails review 3+ times → likely a prompt issue, escalate
-
-## Lessons Learned
- Review bot must output JSON — prevents self-narration collapse
- DISCUSS verdict should be treated same as REQUEST_CHANGES by dev-agent
- Error comments must NOT include `<!-- reviewed: SHA -->` — would falsely mark as reviewed
- Review bot uses forge formal reviews API — branch protection requires different user than PR author
--- a/supervisor/journal/.gitkeep
+++ b/supervisor/journal/.gitkeep
--- a/supervisor/preflight.sh
+++ b/supervisor/preflight.sh
@ -218,7 +218,7 @@ echo ""

 echo "## Pending Vault Items"
 _found_vault=false
-for _vf in "${PROJECT_REPO_ROOT}/vault/pending/"*.md; do
+for _vf in "${OPS_REPO_ROOT}/vault/pending/"*.md; do
  [ -f "$_vf" ] || continue
  _found_vault=true
  _vtitle=$(grep -m1 '^# ' "$_vf" | sed 's/^# //' || basename "$_vf")