feat: progressive disclosure + escalate everything to claude

- PROMPT.md references best-practices/ files instead of inlining all knowledge - best-practices/{memory,disk,ci,dev-agent,git}.md — loaded on demand by claude - All alerts go to claude -p. Claude decides what to fix and what to escalate. - update-prompt.sh targets specific best-practices files for self-learning
2026-03-12 13:04:50 +00:00 · 2026-03-12 13:04:50 +00:00 · 5eb17020d5
commit 5eb17020d5
parent cb7dd398c7
8 changed files with 222 additions and 117 deletions
--- a/factory/PROMPT.md
+++ b/factory/PROMPT.md
@ -1,88 +1,50 @@
-# Factory Supervisor — System Prompt
+# Factory Supervisor

-You are the factory supervisor for the `johba/harb` DeFi protocol repo. You were
-called because `factory-poll.sh` detected an issue it couldn't auto-fix.
-
-## Your Environment
-
- **VPS:** 8GB RAM, 4GB swap, Debian
- **Repo:** `/home/debian/harb` (Codeberg: johba/harb, branch: master, protected)
- **CI:** Woodpecker at localhost:8000 (Docker backend)
- **Stack:** Docker containers (anvil, ponder, webapp, landing, caddy, postgres, txn-bot, otterscan)
- **Tools:** Foundry at `~/.foundry/bin/`, Node at `~/.nvm/versions/node/v22.20.0/bin/`
- **Factory scripts:** See FACTORY_ROOT env var
+You are the factory supervisor for `johba/harb`. You were called because
+`factory-poll.sh` detected an issue it couldn't auto-fix.

 ## Priority Order

-1. **P0 — Memory crisis:** RAM <500MB available OR swap >3GB. Fix IMMEDIATELY.
-2. **P1 — Disk pressure:** Disk >80%. Clean up before builds fail.
-3. **P2 — Factory stopped:** Dev-agent dead, CI down, git repo broken.
-4. **P3 — Factory degraded:** Derailed PR, stuck pipeline, unreviewed PRs.
-5. **P4 — Housekeeping:** Stale processes, log rotation, docker cleanup.
+1. **P0 — Memory crisis:** RAM <500MB or swap >3GB
+2. **P1 — Disk pressure:** Disk >80%
+3. **P2 — Factory stopped:** Dev-agent dead, CI down, git broken
+4. **P3 — Factory degraded:** Derailed PR, stuck pipeline, unreviewed PRs
+5. **P4 — Housekeeping:** Stale processes, log rotation

-## What You Can Do (no permission needed)
+## What You Can Do

- Kill stale `claude` processes (`pgrep -f "claude" | xargs kill`)
- Clean docker: `sudo docker system prune -f` (NOT `-a --volumes` — that kills CI images)
- Truncate large logs: `truncate -s 0 <file>` for factory logs
- Remove stale lock files (`/tmp/dev-agent.lock` if PID is dead)
- Restart dev-agent on a derailed PR: `bash ${FACTORY_ROOT}/dev/dev-agent.sh <issue-number> &`
- Restart frozen Anvil: `sudo docker restart harb-anvil-1`
- Retrigger CI: empty commit + push on a PR branch
- Clean Woodpecker log_entries: `wpdb -c "DELETE FROM log_entries WHERE id < (SELECT max(id)-100000 FROM log_entries);"`
- Drop filesystem caches: `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`
- Prune git worktrees: `cd /home/debian/harb && git worktree prune`
- Kill orphan worktree processes
+Fix the issue yourself. You have full shell access and `--dangerously-skip-permissions`.

-## What You CANNOT Do (escalate to Clawy)
+Before acting, read the relevant best-practices file:
+- Memory issues → `cat ${FACTORY_ROOT}/factory/best-practices/memory.md`
+- Disk issues → `cat ${FACTORY_ROOT}/factory/best-practices/disk.md`
+- CI issues → `cat ${FACTORY_ROOT}/factory/best-practices/ci.md`
+- Dev-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/dev-agent.md`
+- Git issues → `cat ${FACTORY_ROOT}/factory/best-practices/git.md`

- Merge PRs
- Close/reopen issues
- Make architecture decisions
- Modify production contracts
- Run `docker system prune -a --volumes` (kills CI images, hours to rebuild)
- Anything you're unsure about
+## Escalation

-## Best Practices (distilled from experience)
+If you can't fix it, escalate to Clawy (the main agent):
+```bash
+openclaw system event --text "🏭 ESCALATE: <what's wrong and why you can't fix it>" --mode now
+```

-### Memory Management
- Docker containers grow: Anvil reaches 12GB+ within hours. Restart is the fix.
- `claude` processes from dev-agent can zombie at 200MB+ each. Kill any older than 3h.
- `forge build` with via_ir OOMs on 8GB. Never compile full test suite — use `--skip test script`.
- After killing processes, run `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`.
+Do NOT escalate if you can fix it. Do NOT ask permission. Fix first, report after.

-### Disk Management
- Woodpecker `log_entries` table grows to 5GB+. Truncate periodically, then `VACUUM FULL`.
- Docker overlay layers survive normal prune. Use `docker system prune -f` (NOT `-a`).
- Git worktrees in `/tmp/harb-worktree-*` accumulate. Prune if dev-agent is idle.
- Node module caches in worktrees eat disk. Remove `/tmp/harb-worktree-*/node_modules/`.
+## Output

-### CI
- Codeberg rate-limits SSH clones. If `git` step fails with exit 128, retrigger (empty commit).
- CI images are pre-built. `docker system prune -a` deletes them — hours to rebuild.
- Running CI + harb stack = 14+ containers. Only run one pipeline at a time.
- `log_entries` table: truncate when >1GB.
-
-### Dev-Agent
- Lock file at `/tmp/dev-agent.lock`. If PID is dead, remove lock file.
- Worktrees at `/tmp/harb-worktree-<issue>`. Preserved for session continuity.
- `claude` subprocess timeout is 2h. Kill if running longer.
- After killing dev-agent, ensure the issue is unclaimed (remove `in-progress` label).
-
-### Git
- Main repo must be on `master`. If detached HEAD or mid-rebase: `git rebase --abort && git checkout master`.
- Never delete remote branches before confirmed merged.
- Stale worktrees break `git worktree add`. Run `git worktree prune` to fix.
-
-## Output Format
-
-After fixing, output a SHORT summary:
 ```
 FIXED: <what you did>
-REMAINING: <what still needs attention, if any>
+```
+or
+```
+ESCALATE: <what's wrong>
 ```

-If you can't fix it:
-```
-ESCALATE: <what's wrong and why you can't fix it>
+## Learning
+
+If you discover something new, append it to the relevant best-practices file:
+```bash
+bash ${FACTORY_ROOT}/factory/update-prompt.sh "best-practices/<file>.md" "### Lesson title
+Description of what you learned."
 ```
--- a/factory/best-practices/ci.md
+++ b/factory/best-practices/ci.md
@ -0,0 +1,32 @@
+# CI Best Practices
+
+## Environment
+- Woodpecker CI at localhost:8000 (Docker backend)
+- Postgres DB: use `wpdb` helper from env.sh
+- Woodpecker API: use `woodpecker_api` helper from env.sh
+- CI images: pre-built at `registry.niovi.voyage/harb/*:latest`
+
+## Safe Fixes
+- Retrigger CI: push empty commit to PR branch
+  ```bash
+  cd /tmp/harb-worktree-<issue> && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin <branch> --force
+  ```
+- Restart woodpecker-agent: `sudo systemctl restart woodpecker-agent`
+- View pipeline status: `wpdb -c "SELECT number, status FROM pipelines WHERE repo_id=2 ORDER BY number DESC LIMIT 5;"`
+- View failed steps: `bash ${FACTORY_ROOT}/dev/ci-debug.sh failures <pipeline-number>`
+- View step logs: `bash ${FACTORY_ROOT}/dev/ci-debug.sh logs <pipeline-number> <step-name>`
+
+## Dangerous (escalate)
+- Restarting woodpecker-server (drops all running pipelines)
+- Modifying pipeline configs in `.woodpecker/` directory
+
+## Known Issues
+- Codeberg rate-limits SSH clones. `git` step fails with exit 128. Retrigger usually works.
+- `log_entries` table grows fast (was 5.6GB once). Truncate periodically.
+- Running CI + harb stack = 14+ containers on 8GB. Memory pressure is real.
+- CI images take hours to rebuild. Never run `docker system prune -a`.
+
+## Lessons Learned
+- Exit code 128 on git step = Codeberg rate limit, not a code problem. Retrigger.
+- Exit code 137 = OOM kill. Check memory, kill stale processes, retrigger.
+- `node-quality` step fails on eslint/typescript errors — these need code fixes, not CI fixes.
--- a/factory/best-practices/dev-agent.md
+++ b/factory/best-practices/dev-agent.md
@ -0,0 +1,37 @@
+# Dev-Agent Best Practices
+
+## Architecture
+- `dev-poll.sh` (cron */10) → finds ready backlog issues → spawns `dev-agent.sh`
+- `dev-agent.sh` uses `claude -p` for implementation, runs in git worktree
+- Lock file: `/tmp/dev-agent.lock` (contains PID)
+- Status file: `/tmp/dev-agent-status`
+- Worktrees: `/tmp/harb-worktree-<issue-number>/`
+
+## Safe Fixes
+- Remove stale lock: `rm -f /tmp/dev-agent.lock` (only if PID is dead)
+- Kill stuck agent: `kill <pid>` then clean lock
+- Restart on derailed PR: `bash ${FACTORY_ROOT}/dev/dev-agent.sh <issue-number> &`
+- Clean worktree: `cd /home/debian/harb && git worktree remove /tmp/harb-worktree-<N> --force`
+- Remove `in-progress` label if agent died without cleanup:
+  ```bash
+  codeberg_api DELETE "/issues/<N>/labels/in-progress"
+  ```
+
+## Dangerous (escalate)
+- Restarting agent on an issue that has an open PR with review changes — may lose context
+- Anything that modifies the PR branch history
+- Closing PRs or issues
+
+## Known Issues
+- `claude -p -c` (continue) fails if session was compacted — falls back to fresh `-p`
+- CI_FIX_COUNT is now reset on CI pass (fixed 2026-03-12), so each review phase gets fresh CI fix budget
+- Worktree creation fails if main repo has stale rebase — auto-heals now
+- Large text in jq `--arg` can break — write to file first
+- `$([ "$VAR" = true ] && echo "...")` crashes under `set -euo pipefail`
+
+## Lessons Learned
+- Agents don't have memory between tasks — full context must be in the prompt
+- Prior art injection (closed PR diffs) prevents rework
+- Feature issues MUST list affected e2e test files
+- CI fix loop is essential — first attempt rarely works
+- CLAUDE_TIMEOUT=7200 (2h) is needed for complex issues
--- a/factory/best-practices/disk.md
+++ b/factory/best-practices/disk.md
@ -0,0 +1,24 @@
+# Disk Best Practices
+
+## Safe Fixes
+- Docker cleanup: `sudo docker system prune -f` (keeps images, removes stopped containers + dangling layers)
+- Truncate factory logs >5MB: `truncate -s 0 <file>`
+- Remove stale worktrees: check `/tmp/harb-worktree-*`, only if dev-agent not running on them
+- Woodpecker log_entries: `DELETE FROM log_entries WHERE id < (SELECT max(id) - 100000 FROM log_entries);` then `VACUUM;`
+- Node module caches in worktrees: `rm -rf /tmp/harb-worktree-*/node_modules/`
+- Git garbage collection: `cd /home/debian/harb && git gc --prune=now`
+
+## Dangerous (escalate)
+- `docker system prune -a --volumes` — deletes ALL images including CI build cache
+- Deleting anything in `/home/debian/harb/` that's tracked by git
+- Truncating Woodpecker DB tables other than log_entries
+
+## Known Disk Hogs
+- Woodpecker `log_entries` table: grows to 5GB+. Truncate periodically.
+- Docker overlay layers: survive normal prune. `-a` variant kills everything.
+- Git worktrees in /tmp: accumulate node_modules, build artifacts
+- Forge cache in `~/.foundry/cache/`: can grow large with many compilations
+
+## Lessons Learned
+- After truncating log_entries, run VACUUM FULL (reclaims actual disk space)
+- Docker ghost overlay layers need `prune -a` but that kills CI images — only do this if truly desperate
--- a/factory/best-practices/git.md
+++ b/factory/best-practices/git.md
@ -0,0 +1,30 @@
+# Git Best Practices
+
+## Environment
+- Repo: `/home/debian/harb`, remote: `codeberg.org/johba/harb`
+- Branch: `master` (protected — no direct push, PRs only)
+- Worktrees: `/tmp/harb-worktree-<issue>/`
+
+## Safe Fixes
+- Abort stale rebase: `cd /home/debian/harb && git rebase --abort`
+- Switch to master: `git checkout master`
+- Prune worktrees: `git worktree prune`
+- Reset dirty state: `git checkout -- .` (only uncommitted changes)
+- Fetch latest: `git fetch origin master`
+
+## Dangerous (escalate)
+- `git reset --hard` on any branch with unpushed work
+- Deleting remote branches
+- Force-pushing to any branch
+- Anything on the master branch directly
+
+## Known Issues
+- Main repo MUST be on master at all times. Dev work happens in worktrees.
+- Stale rebases (detached HEAD) break all worktree creation — silent factory stall.
+- `git worktree add` fails if target directory exists (even empty). Remove first.
+- Many old branches exist locally (100+). Normal — don't bulk-delete.
+
+## Lessons Learned
+- NEVER delete remote branches before confirming merge. Close PR, rebase locally, force-push if needed.
+- Stale rebase caused 5h factory stall once (2026-03-11). Auto-heal added to dev-agent.
+- lint-staged hooks fail when `forge` not in PATH. Use `--no-verify` when committing from scripts.
--- a/factory/best-practices/memory.md
+++ b/factory/best-practices/memory.md
@ -0,0 +1,29 @@
+# Memory Best Practices
+
+## Environment
+- VPS: 8GB RAM, 4GB swap, Debian
+- Running: Docker stack (8 containers), Woodpecker CI, OpenClaw gateway
+
+## Safe Fixes (no permission needed)
+- Kill stale `claude` processes (>3h old): `pgrep -f "claude" --older 10800 | xargs kill`
+- Drop filesystem caches: `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`
+- Restart bloated Anvil: `sudo docker restart harb-anvil-1` (grows to 12GB+ over hours)
+- Kill orphan node processes from dead worktrees
+
+## Dangerous (escalate)
+- `docker system prune -a --volumes` — kills CI images, hours to rebuild
+- Stopping harb stack containers — breaks dev environment
+- OOM that survives all safe fixes — needs human decision on what to kill
+
+## Known Memory Hogs
+- `claude` processes from dev-agent: 200MB+ each, can zombie
+- `dockerd`: 600MB+ baseline (normal)
+- `openclaw-gateway`: 500MB+ (normal)
+- Anvil container: starts small, grows unbounded over hours
+- `forge build` with via_ir: can spike to 4GB+. Use `--skip test script` to reduce.
+- Vite dev servers inside containers: 150MB+ each
+
+## Lessons Learned
+- After killing processes, always `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`
+- Swap doesn't drain from dropping caches alone — it's actual paged-out process memory
+- Running CI + full harb stack = 14+ containers on 8GB. Only one pipeline at a time.
--- a/factory/factory-poll.sh
+++ b/factory/factory-poll.sh
@ -267,19 +267,15 @@ ALL_ALERTS="${P0_ALERTS}${P1_ALERTS}${P2_ALERTS}${P3_ALERTS}${P4_ALERTS}"

 if [ -n "$ALL_ALERTS" ]; then
  ALERT_TEXT=$(echo -e "$ALL_ALERTS")
-  FIX_TEXT=""
-  [ -n "$FIXES" ] && FIX_TEXT="\n\nAuto-fixed:\n$(echo -e "$FIXES")"

-  # If P0 or P1 alerts remain after auto-fix, invoke claude -p
-  if [ -n "$P0_ALERTS" ] || [ -n "$P1_ALERTS" ]; then
-    flog "Invoking claude -p for critical alert"
+  flog "Invoking claude -p for alerts"

  CLAUDE_PROMPT="$(cat "$PROMPT_FILE" 2>/dev/null || echo "You are a factory supervisor. Fix the issue below.")

-## Current Alert
+## Current Alerts
 ${ALERT_TEXT}

-## Auto-fixes already applied
+## Auto-fixes already applied by bash
 $(echo -e "${FIXES:-None}")

 ## System State
@ -288,30 +284,12 @@ Disk: $(df -h / | awk 'NR==2{printf "%s used of %s (%s)", $3, $2, $5}')
 Docker: $(sudo docker ps --format '{{.Names}}' 2>/dev/null | wc -l) containers running
 Claude procs: $(pgrep -f "claude" 2>/dev/null | wc -l)

-## Instructions
-Fix whatever you can. Follow priority order. Output FIXED/ESCALATE summary."
+Fix what you can. Escalate what you can't. Read the relevant best-practices file first."

  CLAUDE_OUTPUT=$(timeout 300 claude -p --model sonnet --dangerously-skip-permissions \
    "$CLAUDE_PROMPT" 2>&1) || true
-    flog "claude output: ${CLAUDE_OUTPUT}"
-
-    # If claude fixed things, don't escalate
-    if echo "$CLAUDE_OUTPUT" | grep -q "^FIXED:"; then
-      flog "claude fixed the issue"
-      ALL_ALERTS=""  # Clear — handled
-    fi
-  fi
-
-  # Escalate remaining alerts
-  if [ -n "$ALL_ALERTS" ]; then
-    openclaw system event \
-      --text "🏭 Factory Alert:
-${ALERT_TEXT}${FIX_TEXT}" \
-      --mode now 2>/dev/null || true
-    status "alerts escalated"
-  else
-    status "all issues auto-resolved"
-  fi
+  flog "claude output: $(echo "$CLAUDE_OUTPUT" | tail -20)"
+  status "claude responded"
 else
  [ -n "$FIXES" ] && flog "Housekeeping: $(echo -e "$FIXES")"
  status "all clear"
--- a/factory/update-prompt.sh
+++ b/factory/update-prompt.sh
@ -1,34 +1,47 @@
 #!/usr/bin/env bash
-# update-prompt.sh — Append a lesson learned to PROMPT.md
+# update-prompt.sh — Append a lesson to a best-practices file
 #
 # Usage:
-#   ./factory/update-prompt.sh "### Title" "Body text describing the lesson"
-#   ./factory/update-prompt.sh --from-file /tmp/lesson.md
+#   ./factory/update-prompt.sh "best-practices/memory.md" "### Title\nBody text"
+#   ./factory/update-prompt.sh --from-file "best-practices/memory.md" /tmp/lesson.md
 #
-# Called by claude -p when it learns something new during a fix.
+# Called by claude -p when it learns something during a fix.
 # Commits and pushes the update to the dark-factory repo.

 source "$(dirname "$0")/../lib/env.sh"

-PROMPT_FILE="${FACTORY_ROOT}/factory/PROMPT.md"
+TARGET_FILE="${FACTORY_ROOT}/factory/$1"
+shift

 if [ "$1" = "--from-file" ] && [ -f "$2" ]; then
  LESSON=$(cat "$2")
-elif [ -n "$1" ] && [ -n "$2" ]; then
-  LESSON="$1
-$2"
+elif [ -n "$1" ]; then
+  LESSON="$1"
 else
-  echo "Usage: update-prompt.sh 'Title' 'Body' OR update-prompt.sh --from-file path" >&2
+  echo "Usage: update-prompt.sh <relative-path> '<lesson text>'" >&2
+  echo "   or: update-prompt.sh <relative-path> --from-file <path>" >&2
  exit 1
 fi

-# Append to PROMPT.md under Best Practices
-echo "" >> "$PROMPT_FILE"
-echo "$LESSON" >> "$PROMPT_FILE"
+if [ ! -f "$TARGET_FILE" ]; then
+  echo "Target file not found: $TARGET_FILE" >&2
+  exit 1
+fi
+
+# Append under "Lessons Learned" section if it exists, otherwise at end
+if grep -q "## Lessons Learned" "$TARGET_FILE"; then
+  echo "" >> "$TARGET_FILE"
+  echo "$LESSON" >> "$TARGET_FILE"
+else
+  echo "" >> "$TARGET_FILE"
+  echo "## Lessons Learned" >> "$TARGET_FILE"
+  echo "" >> "$TARGET_FILE"
+  echo "$LESSON" >> "$TARGET_FILE"
+fi

 cd "$FACTORY_ROOT"
-git add factory/PROMPT.md
-git commit -m "factory: update supervisor best practices" --no-verify 2>/dev/null
+git add "factory/$1" 2>/dev/null || git add "$TARGET_FILE"
+git commit -m "factory: learned — $(echo "$LESSON" | head -1 | sed 's/^#* *//')" --no-verify 2>/dev/null
 git push origin main 2>/dev/null

-log "PROMPT.md updated with new lesson"
+log "Updated $(basename "$TARGET_FILE") with new lesson"