diff --git a/factory/PROMPT.md b/factory/PROMPT.md index 389adac..2de4dbd 100644 --- a/factory/PROMPT.md +++ b/factory/PROMPT.md @@ -1,88 +1,50 @@ -# Factory Supervisor — System Prompt +# Factory Supervisor -You are the factory supervisor for the `johba/harb` DeFi protocol repo. You were -called because `factory-poll.sh` detected an issue it couldn't auto-fix. - -## Your Environment - -- **VPS:** 8GB RAM, 4GB swap, Debian -- **Repo:** `/home/debian/harb` (Codeberg: johba/harb, branch: master, protected) -- **CI:** Woodpecker at localhost:8000 (Docker backend) -- **Stack:** Docker containers (anvil, ponder, webapp, landing, caddy, postgres, txn-bot, otterscan) -- **Tools:** Foundry at `~/.foundry/bin/`, Node at `~/.nvm/versions/node/v22.20.0/bin/` -- **Factory scripts:** See FACTORY_ROOT env var +You are the factory supervisor for `johba/harb`. You were called because +`factory-poll.sh` detected an issue it couldn't auto-fix. ## Priority Order -1. **P0 — Memory crisis:** RAM <500MB available OR swap >3GB. Fix IMMEDIATELY. -2. **P1 — Disk pressure:** Disk >80%. Clean up before builds fail. -3. **P2 — Factory stopped:** Dev-agent dead, CI down, git repo broken. -4. **P3 — Factory degraded:** Derailed PR, stuck pipeline, unreviewed PRs. -5. **P4 — Housekeeping:** Stale processes, log rotation, docker cleanup. +1. **P0 — Memory crisis:** RAM <500MB or swap >3GB +2. **P1 — Disk pressure:** Disk >80% +3. **P2 — Factory stopped:** Dev-agent dead, CI down, git broken +4. **P3 — Factory degraded:** Derailed PR, stuck pipeline, unreviewed PRs +5. **P4 — Housekeeping:** Stale processes, log rotation -## What You Can Do (no permission needed) +## What You Can Do -- Kill stale `claude` processes (`pgrep -f "claude" | xargs kill`) -- Clean docker: `sudo docker system prune -f` (NOT `-a --volumes` — that kills CI images) -- Truncate large logs: `truncate -s 0 ` for factory logs -- Remove stale lock files (`/tmp/dev-agent.lock` if PID is dead) -- Restart dev-agent on a derailed PR: `bash ${FACTORY_ROOT}/dev/dev-agent.sh &` -- Restart frozen Anvil: `sudo docker restart harb-anvil-1` -- Retrigger CI: empty commit + push on a PR branch -- Clean Woodpecker log_entries: `wpdb -c "DELETE FROM log_entries WHERE id < (SELECT max(id)-100000 FROM log_entries);"` -- Drop filesystem caches: `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches` -- Prune git worktrees: `cd /home/debian/harb && git worktree prune` -- Kill orphan worktree processes +Fix the issue yourself. You have full shell access and `--dangerously-skip-permissions`. -## What You CANNOT Do (escalate to Clawy) +Before acting, read the relevant best-practices file: +- Memory issues → `cat ${FACTORY_ROOT}/factory/best-practices/memory.md` +- Disk issues → `cat ${FACTORY_ROOT}/factory/best-practices/disk.md` +- CI issues → `cat ${FACTORY_ROOT}/factory/best-practices/ci.md` +- Dev-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/dev-agent.md` +- Git issues → `cat ${FACTORY_ROOT}/factory/best-practices/git.md` -- Merge PRs -- Close/reopen issues -- Make architecture decisions -- Modify production contracts -- Run `docker system prune -a --volumes` (kills CI images, hours to rebuild) -- Anything you're unsure about +## Escalation -## Best Practices (distilled from experience) +If you can't fix it, escalate to Clawy (the main agent): +```bash +openclaw system event --text "🏭 ESCALATE: " --mode now +``` -### Memory Management -- Docker containers grow: Anvil reaches 12GB+ within hours. Restart is the fix. -- `claude` processes from dev-agent can zombie at 200MB+ each. Kill any older than 3h. -- `forge build` with via_ir OOMs on 8GB. Never compile full test suite — use `--skip test script`. -- After killing processes, run `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`. +Do NOT escalate if you can fix it. Do NOT ask permission. Fix first, report after. -### Disk Management -- Woodpecker `log_entries` table grows to 5GB+. Truncate periodically, then `VACUUM FULL`. -- Docker overlay layers survive normal prune. Use `docker system prune -f` (NOT `-a`). -- Git worktrees in `/tmp/harb-worktree-*` accumulate. Prune if dev-agent is idle. -- Node module caches in worktrees eat disk. Remove `/tmp/harb-worktree-*/node_modules/`. +## Output -### CI -- Codeberg rate-limits SSH clones. If `git` step fails with exit 128, retrigger (empty commit). -- CI images are pre-built. `docker system prune -a` deletes them — hours to rebuild. -- Running CI + harb stack = 14+ containers. Only run one pipeline at a time. -- `log_entries` table: truncate when >1GB. - -### Dev-Agent -- Lock file at `/tmp/dev-agent.lock`. If PID is dead, remove lock file. -- Worktrees at `/tmp/harb-worktree-`. Preserved for session continuity. -- `claude` subprocess timeout is 2h. Kill if running longer. -- After killing dev-agent, ensure the issue is unclaimed (remove `in-progress` label). - -### Git -- Main repo must be on `master`. If detached HEAD or mid-rebase: `git rebase --abort && git checkout master`. -- Never delete remote branches before confirmed merged. -- Stale worktrees break `git worktree add`. Run `git worktree prune` to fix. - -## Output Format - -After fixing, output a SHORT summary: ``` FIXED: -REMAINING: +``` +or +``` +ESCALATE: ``` -If you can't fix it: -``` -ESCALATE: +## Learning + +If you discover something new, append it to the relevant best-practices file: +```bash +bash ${FACTORY_ROOT}/factory/update-prompt.sh "best-practices/.md" "### Lesson title +Description of what you learned." ``` diff --git a/factory/best-practices/ci.md b/factory/best-practices/ci.md new file mode 100644 index 0000000..bfcaa1c --- /dev/null +++ b/factory/best-practices/ci.md @@ -0,0 +1,32 @@ +# CI Best Practices + +## Environment +- Woodpecker CI at localhost:8000 (Docker backend) +- Postgres DB: use `wpdb` helper from env.sh +- Woodpecker API: use `woodpecker_api` helper from env.sh +- CI images: pre-built at `registry.niovi.voyage/harb/*:latest` + +## Safe Fixes +- Retrigger CI: push empty commit to PR branch + ```bash + cd /tmp/harb-worktree- && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin --force + ``` +- Restart woodpecker-agent: `sudo systemctl restart woodpecker-agent` +- View pipeline status: `wpdb -c "SELECT number, status FROM pipelines WHERE repo_id=2 ORDER BY number DESC LIMIT 5;"` +- View failed steps: `bash ${FACTORY_ROOT}/dev/ci-debug.sh failures ` +- View step logs: `bash ${FACTORY_ROOT}/dev/ci-debug.sh logs ` + +## Dangerous (escalate) +- Restarting woodpecker-server (drops all running pipelines) +- Modifying pipeline configs in `.woodpecker/` directory + +## Known Issues +- Codeberg rate-limits SSH clones. `git` step fails with exit 128. Retrigger usually works. +- `log_entries` table grows fast (was 5.6GB once). Truncate periodically. +- Running CI + harb stack = 14+ containers on 8GB. Memory pressure is real. +- CI images take hours to rebuild. Never run `docker system prune -a`. + +## Lessons Learned +- Exit code 128 on git step = Codeberg rate limit, not a code problem. Retrigger. +- Exit code 137 = OOM kill. Check memory, kill stale processes, retrigger. +- `node-quality` step fails on eslint/typescript errors — these need code fixes, not CI fixes. diff --git a/factory/best-practices/dev-agent.md b/factory/best-practices/dev-agent.md new file mode 100644 index 0000000..1f8b6f6 --- /dev/null +++ b/factory/best-practices/dev-agent.md @@ -0,0 +1,37 @@ +# Dev-Agent Best Practices + +## Architecture +- `dev-poll.sh` (cron */10) → finds ready backlog issues → spawns `dev-agent.sh` +- `dev-agent.sh` uses `claude -p` for implementation, runs in git worktree +- Lock file: `/tmp/dev-agent.lock` (contains PID) +- Status file: `/tmp/dev-agent-status` +- Worktrees: `/tmp/harb-worktree-/` + +## Safe Fixes +- Remove stale lock: `rm -f /tmp/dev-agent.lock` (only if PID is dead) +- Kill stuck agent: `kill ` then clean lock +- Restart on derailed PR: `bash ${FACTORY_ROOT}/dev/dev-agent.sh &` +- Clean worktree: `cd /home/debian/harb && git worktree remove /tmp/harb-worktree- --force` +- Remove `in-progress` label if agent died without cleanup: + ```bash + codeberg_api DELETE "/issues//labels/in-progress" + ``` + +## Dangerous (escalate) +- Restarting agent on an issue that has an open PR with review changes — may lose context +- Anything that modifies the PR branch history +- Closing PRs or issues + +## Known Issues +- `claude -p -c` (continue) fails if session was compacted — falls back to fresh `-p` +- CI_FIX_COUNT is now reset on CI pass (fixed 2026-03-12), so each review phase gets fresh CI fix budget +- Worktree creation fails if main repo has stale rebase — auto-heals now +- Large text in jq `--arg` can break — write to file first +- `$([ "$VAR" = true ] && echo "...")` crashes under `set -euo pipefail` + +## Lessons Learned +- Agents don't have memory between tasks — full context must be in the prompt +- Prior art injection (closed PR diffs) prevents rework +- Feature issues MUST list affected e2e test files +- CI fix loop is essential — first attempt rarely works +- CLAUDE_TIMEOUT=7200 (2h) is needed for complex issues diff --git a/factory/best-practices/disk.md b/factory/best-practices/disk.md new file mode 100644 index 0000000..4d2a411 --- /dev/null +++ b/factory/best-practices/disk.md @@ -0,0 +1,24 @@ +# Disk Best Practices + +## Safe Fixes +- Docker cleanup: `sudo docker system prune -f` (keeps images, removes stopped containers + dangling layers) +- Truncate factory logs >5MB: `truncate -s 0 ` +- Remove stale worktrees: check `/tmp/harb-worktree-*`, only if dev-agent not running on them +- Woodpecker log_entries: `DELETE FROM log_entries WHERE id < (SELECT max(id) - 100000 FROM log_entries);` then `VACUUM;` +- Node module caches in worktrees: `rm -rf /tmp/harb-worktree-*/node_modules/` +- Git garbage collection: `cd /home/debian/harb && git gc --prune=now` + +## Dangerous (escalate) +- `docker system prune -a --volumes` — deletes ALL images including CI build cache +- Deleting anything in `/home/debian/harb/` that's tracked by git +- Truncating Woodpecker DB tables other than log_entries + +## Known Disk Hogs +- Woodpecker `log_entries` table: grows to 5GB+. Truncate periodically. +- Docker overlay layers: survive normal prune. `-a` variant kills everything. +- Git worktrees in /tmp: accumulate node_modules, build artifacts +- Forge cache in `~/.foundry/cache/`: can grow large with many compilations + +## Lessons Learned +- After truncating log_entries, run VACUUM FULL (reclaims actual disk space) +- Docker ghost overlay layers need `prune -a` but that kills CI images — only do this if truly desperate diff --git a/factory/best-practices/git.md b/factory/best-practices/git.md new file mode 100644 index 0000000..45d4e68 --- /dev/null +++ b/factory/best-practices/git.md @@ -0,0 +1,30 @@ +# Git Best Practices + +## Environment +- Repo: `/home/debian/harb`, remote: `codeberg.org/johba/harb` +- Branch: `master` (protected — no direct push, PRs only) +- Worktrees: `/tmp/harb-worktree-/` + +## Safe Fixes +- Abort stale rebase: `cd /home/debian/harb && git rebase --abort` +- Switch to master: `git checkout master` +- Prune worktrees: `git worktree prune` +- Reset dirty state: `git checkout -- .` (only uncommitted changes) +- Fetch latest: `git fetch origin master` + +## Dangerous (escalate) +- `git reset --hard` on any branch with unpushed work +- Deleting remote branches +- Force-pushing to any branch +- Anything on the master branch directly + +## Known Issues +- Main repo MUST be on master at all times. Dev work happens in worktrees. +- Stale rebases (detached HEAD) break all worktree creation — silent factory stall. +- `git worktree add` fails if target directory exists (even empty). Remove first. +- Many old branches exist locally (100+). Normal — don't bulk-delete. + +## Lessons Learned +- NEVER delete remote branches before confirming merge. Close PR, rebase locally, force-push if needed. +- Stale rebase caused 5h factory stall once (2026-03-11). Auto-heal added to dev-agent. +- lint-staged hooks fail when `forge` not in PATH. Use `--no-verify` when committing from scripts. diff --git a/factory/best-practices/memory.md b/factory/best-practices/memory.md new file mode 100644 index 0000000..1954b0f --- /dev/null +++ b/factory/best-practices/memory.md @@ -0,0 +1,29 @@ +# Memory Best Practices + +## Environment +- VPS: 8GB RAM, 4GB swap, Debian +- Running: Docker stack (8 containers), Woodpecker CI, OpenClaw gateway + +## Safe Fixes (no permission needed) +- Kill stale `claude` processes (>3h old): `pgrep -f "claude" --older 10800 | xargs kill` +- Drop filesystem caches: `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches` +- Restart bloated Anvil: `sudo docker restart harb-anvil-1` (grows to 12GB+ over hours) +- Kill orphan node processes from dead worktrees + +## Dangerous (escalate) +- `docker system prune -a --volumes` — kills CI images, hours to rebuild +- Stopping harb stack containers — breaks dev environment +- OOM that survives all safe fixes — needs human decision on what to kill + +## Known Memory Hogs +- `claude` processes from dev-agent: 200MB+ each, can zombie +- `dockerd`: 600MB+ baseline (normal) +- `openclaw-gateway`: 500MB+ (normal) +- Anvil container: starts small, grows unbounded over hours +- `forge build` with via_ir: can spike to 4GB+. Use `--skip test script` to reduce. +- Vite dev servers inside containers: 150MB+ each + +## Lessons Learned +- After killing processes, always `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches` +- Swap doesn't drain from dropping caches alone — it's actual paged-out process memory +- Running CI + full harb stack = 14+ containers on 8GB. Only one pipeline at a time. diff --git a/factory/factory-poll.sh b/factory/factory-poll.sh index c74be55..fa1cd99 100755 --- a/factory/factory-poll.sh +++ b/factory/factory-poll.sh @@ -267,19 +267,15 @@ ALL_ALERTS="${P0_ALERTS}${P1_ALERTS}${P2_ALERTS}${P3_ALERTS}${P4_ALERTS}" if [ -n "$ALL_ALERTS" ]; then ALERT_TEXT=$(echo -e "$ALL_ALERTS") - FIX_TEXT="" - [ -n "$FIXES" ] && FIX_TEXT="\n\nAuto-fixed:\n$(echo -e "$FIXES")" - # If P0 or P1 alerts remain after auto-fix, invoke claude -p - if [ -n "$P0_ALERTS" ] || [ -n "$P1_ALERTS" ]; then - flog "Invoking claude -p for critical alert" + flog "Invoking claude -p for alerts" - CLAUDE_PROMPT="$(cat "$PROMPT_FILE" 2>/dev/null || echo "You are a factory supervisor. Fix the issue below.") + CLAUDE_PROMPT="$(cat "$PROMPT_FILE" 2>/dev/null || echo "You are a factory supervisor. Fix the issue below.") -## Current Alert +## Current Alerts ${ALERT_TEXT} -## Auto-fixes already applied +## Auto-fixes already applied by bash $(echo -e "${FIXES:-None}") ## System State @@ -288,30 +284,12 @@ Disk: $(df -h / | awk 'NR==2{printf "%s used of %s (%s)", $3, $2, $5}') Docker: $(sudo docker ps --format '{{.Names}}' 2>/dev/null | wc -l) containers running Claude procs: $(pgrep -f "claude" 2>/dev/null | wc -l) -## Instructions -Fix whatever you can. Follow priority order. Output FIXED/ESCALATE summary." +Fix what you can. Escalate what you can't. Read the relevant best-practices file first." - CLAUDE_OUTPUT=$(timeout 300 claude -p --model sonnet --dangerously-skip-permissions \ - "$CLAUDE_PROMPT" 2>&1) || true - flog "claude output: ${CLAUDE_OUTPUT}" - - # If claude fixed things, don't escalate - if echo "$CLAUDE_OUTPUT" | grep -q "^FIXED:"; then - flog "claude fixed the issue" - ALL_ALERTS="" # Clear — handled - fi - fi - - # Escalate remaining alerts - if [ -n "$ALL_ALERTS" ]; then - openclaw system event \ - --text "🏭 Factory Alert: -${ALERT_TEXT}${FIX_TEXT}" \ - --mode now 2>/dev/null || true - status "alerts escalated" - else - status "all issues auto-resolved" - fi + CLAUDE_OUTPUT=$(timeout 300 claude -p --model sonnet --dangerously-skip-permissions \ + "$CLAUDE_PROMPT" 2>&1) || true + flog "claude output: $(echo "$CLAUDE_OUTPUT" | tail -20)" + status "claude responded" else [ -n "$FIXES" ] && flog "Housekeeping: $(echo -e "$FIXES")" status "all clear" diff --git a/factory/update-prompt.sh b/factory/update-prompt.sh index ab7eb88..0d1e740 100755 --- a/factory/update-prompt.sh +++ b/factory/update-prompt.sh @@ -1,34 +1,47 @@ #!/usr/bin/env bash -# update-prompt.sh — Append a lesson learned to PROMPT.md +# update-prompt.sh — Append a lesson to a best-practices file # # Usage: -# ./factory/update-prompt.sh "### Title" "Body text describing the lesson" -# ./factory/update-prompt.sh --from-file /tmp/lesson.md +# ./factory/update-prompt.sh "best-practices/memory.md" "### Title\nBody text" +# ./factory/update-prompt.sh --from-file "best-practices/memory.md" /tmp/lesson.md # -# Called by claude -p when it learns something new during a fix. +# Called by claude -p when it learns something during a fix. # Commits and pushes the update to the dark-factory repo. source "$(dirname "$0")/../lib/env.sh" -PROMPT_FILE="${FACTORY_ROOT}/factory/PROMPT.md" +TARGET_FILE="${FACTORY_ROOT}/factory/$1" +shift if [ "$1" = "--from-file" ] && [ -f "$2" ]; then LESSON=$(cat "$2") -elif [ -n "$1" ] && [ -n "$2" ]; then - LESSON="$1 -$2" +elif [ -n "$1" ]; then + LESSON="$1" else - echo "Usage: update-prompt.sh 'Title' 'Body' OR update-prompt.sh --from-file path" >&2 + echo "Usage: update-prompt.sh ''" >&2 + echo " or: update-prompt.sh --from-file " >&2 exit 1 fi -# Append to PROMPT.md under Best Practices -echo "" >> "$PROMPT_FILE" -echo "$LESSON" >> "$PROMPT_FILE" +if [ ! -f "$TARGET_FILE" ]; then + echo "Target file not found: $TARGET_FILE" >&2 + exit 1 +fi + +# Append under "Lessons Learned" section if it exists, otherwise at end +if grep -q "## Lessons Learned" "$TARGET_FILE"; then + echo "" >> "$TARGET_FILE" + echo "$LESSON" >> "$TARGET_FILE" +else + echo "" >> "$TARGET_FILE" + echo "## Lessons Learned" >> "$TARGET_FILE" + echo "" >> "$TARGET_FILE" + echo "$LESSON" >> "$TARGET_FILE" +fi cd "$FACTORY_ROOT" -git add factory/PROMPT.md -git commit -m "factory: update supervisor best practices" --no-verify 2>/dev/null +git add "factory/$1" 2>/dev/null || git add "$TARGET_FILE" +git commit -m "factory: learned — $(echo "$LESSON" | head -1 | sed 's/^#* *//')" --no-verify 2>/dev/null git push origin main 2>/dev/null -log "PROMPT.md updated with new lesson" +log "Updated $(basename "$TARGET_FILE") with new lesson"