feat: progressive disclosure + escalate everything to claude
- PROMPT.md references best-practices/ files instead of inlining all knowledge
- best-practices/{memory,disk,ci,dev-agent,git}.md — loaded on demand by claude
- All alerts go to claude -p. Claude decides what to fix and what to escalate.
- update-prompt.sh targets specific best-practices files for self-learning
This commit is contained in:
parent
cb7dd398c7
commit
5eb17020d5
8 changed files with 222 additions and 117 deletions
32
factory/best-practices/ci.md
Normal file
32
factory/best-practices/ci.md
Normal file
|
|
@ -0,0 +1,32 @@
|
|||
# CI Best Practices
|
||||
|
||||
## Environment
|
||||
- Woodpecker CI at localhost:8000 (Docker backend)
|
||||
- Postgres DB: use `wpdb` helper from env.sh
|
||||
- Woodpecker API: use `woodpecker_api` helper from env.sh
|
||||
- CI images: pre-built at `registry.niovi.voyage/harb/*:latest`
|
||||
|
||||
## Safe Fixes
|
||||
- Retrigger CI: push empty commit to PR branch
|
||||
```bash
|
||||
cd /tmp/harb-worktree-<issue> && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin <branch> --force
|
||||
```
|
||||
- Restart woodpecker-agent: `sudo systemctl restart woodpecker-agent`
|
||||
- View pipeline status: `wpdb -c "SELECT number, status FROM pipelines WHERE repo_id=2 ORDER BY number DESC LIMIT 5;"`
|
||||
- View failed steps: `bash ${FACTORY_ROOT}/dev/ci-debug.sh failures <pipeline-number>`
|
||||
- View step logs: `bash ${FACTORY_ROOT}/dev/ci-debug.sh logs <pipeline-number> <step-name>`
|
||||
|
||||
## Dangerous (escalate)
|
||||
- Restarting woodpecker-server (drops all running pipelines)
|
||||
- Modifying pipeline configs in `.woodpecker/` directory
|
||||
|
||||
## Known Issues
|
||||
- Codeberg rate-limits SSH clones. `git` step fails with exit 128. Retrigger usually works.
|
||||
- `log_entries` table grows fast (was 5.6GB once). Truncate periodically.
|
||||
- Running CI + harb stack = 14+ containers on 8GB. Memory pressure is real.
|
||||
- CI images take hours to rebuild. Never run `docker system prune -a`.
|
||||
|
||||
## Lessons Learned
|
||||
- Exit code 128 on git step = Codeberg rate limit, not a code problem. Retrigger.
|
||||
- Exit code 137 = OOM kill. Check memory, kill stale processes, retrigger.
|
||||
- `node-quality` step fails on eslint/typescript errors — these need code fixes, not CI fixes.
|
||||
37
factory/best-practices/dev-agent.md
Normal file
37
factory/best-practices/dev-agent.md
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
# Dev-Agent Best Practices
|
||||
|
||||
## Architecture
|
||||
- `dev-poll.sh` (cron */10) → finds ready backlog issues → spawns `dev-agent.sh`
|
||||
- `dev-agent.sh` uses `claude -p` for implementation, runs in git worktree
|
||||
- Lock file: `/tmp/dev-agent.lock` (contains PID)
|
||||
- Status file: `/tmp/dev-agent-status`
|
||||
- Worktrees: `/tmp/harb-worktree-<issue-number>/`
|
||||
|
||||
## Safe Fixes
|
||||
- Remove stale lock: `rm -f /tmp/dev-agent.lock` (only if PID is dead)
|
||||
- Kill stuck agent: `kill <pid>` then clean lock
|
||||
- Restart on derailed PR: `bash ${FACTORY_ROOT}/dev/dev-agent.sh <issue-number> &`
|
||||
- Clean worktree: `cd /home/debian/harb && git worktree remove /tmp/harb-worktree-<N> --force`
|
||||
- Remove `in-progress` label if agent died without cleanup:
|
||||
```bash
|
||||
codeberg_api DELETE "/issues/<N>/labels/in-progress"
|
||||
```
|
||||
|
||||
## Dangerous (escalate)
|
||||
- Restarting agent on an issue that has an open PR with review changes — may lose context
|
||||
- Anything that modifies the PR branch history
|
||||
- Closing PRs or issues
|
||||
|
||||
## Known Issues
|
||||
- `claude -p -c` (continue) fails if session was compacted — falls back to fresh `-p`
|
||||
- CI_FIX_COUNT is now reset on CI pass (fixed 2026-03-12), so each review phase gets fresh CI fix budget
|
||||
- Worktree creation fails if main repo has stale rebase — auto-heals now
|
||||
- Large text in jq `--arg` can break — write to file first
|
||||
- `$([ "$VAR" = true ] && echo "...")` crashes under `set -euo pipefail`
|
||||
|
||||
## Lessons Learned
|
||||
- Agents don't have memory between tasks — full context must be in the prompt
|
||||
- Prior art injection (closed PR diffs) prevents rework
|
||||
- Feature issues MUST list affected e2e test files
|
||||
- CI fix loop is essential — first attempt rarely works
|
||||
- CLAUDE_TIMEOUT=7200 (2h) is needed for complex issues
|
||||
24
factory/best-practices/disk.md
Normal file
24
factory/best-practices/disk.md
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
# Disk Best Practices
|
||||
|
||||
## Safe Fixes
|
||||
- Docker cleanup: `sudo docker system prune -f` (keeps images, removes stopped containers + dangling layers)
|
||||
- Truncate factory logs >5MB: `truncate -s 0 <file>`
|
||||
- Remove stale worktrees: check `/tmp/harb-worktree-*`, only if dev-agent not running on them
|
||||
- Woodpecker log_entries: `DELETE FROM log_entries WHERE id < (SELECT max(id) - 100000 FROM log_entries);` then `VACUUM;`
|
||||
- Node module caches in worktrees: `rm -rf /tmp/harb-worktree-*/node_modules/`
|
||||
- Git garbage collection: `cd /home/debian/harb && git gc --prune=now`
|
||||
|
||||
## Dangerous (escalate)
|
||||
- `docker system prune -a --volumes` — deletes ALL images including CI build cache
|
||||
- Deleting anything in `/home/debian/harb/` that's tracked by git
|
||||
- Truncating Woodpecker DB tables other than log_entries
|
||||
|
||||
## Known Disk Hogs
|
||||
- Woodpecker `log_entries` table: grows to 5GB+. Truncate periodically.
|
||||
- Docker overlay layers: survive normal prune. `-a` variant kills everything.
|
||||
- Git worktrees in /tmp: accumulate node_modules, build artifacts
|
||||
- Forge cache in `~/.foundry/cache/`: can grow large with many compilations
|
||||
|
||||
## Lessons Learned
|
||||
- After truncating log_entries, run VACUUM FULL (reclaims actual disk space)
|
||||
- Docker ghost overlay layers need `prune -a` but that kills CI images — only do this if truly desperate
|
||||
30
factory/best-practices/git.md
Normal file
30
factory/best-practices/git.md
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
# Git Best Practices
|
||||
|
||||
## Environment
|
||||
- Repo: `/home/debian/harb`, remote: `codeberg.org/johba/harb`
|
||||
- Branch: `master` (protected — no direct push, PRs only)
|
||||
- Worktrees: `/tmp/harb-worktree-<issue>/`
|
||||
|
||||
## Safe Fixes
|
||||
- Abort stale rebase: `cd /home/debian/harb && git rebase --abort`
|
||||
- Switch to master: `git checkout master`
|
||||
- Prune worktrees: `git worktree prune`
|
||||
- Reset dirty state: `git checkout -- .` (only uncommitted changes)
|
||||
- Fetch latest: `git fetch origin master`
|
||||
|
||||
## Dangerous (escalate)
|
||||
- `git reset --hard` on any branch with unpushed work
|
||||
- Deleting remote branches
|
||||
- Force-pushing to any branch
|
||||
- Anything on the master branch directly
|
||||
|
||||
## Known Issues
|
||||
- Main repo MUST be on master at all times. Dev work happens in worktrees.
|
||||
- Stale rebases (detached HEAD) break all worktree creation — silent factory stall.
|
||||
- `git worktree add` fails if target directory exists (even empty). Remove first.
|
||||
- Many old branches exist locally (100+). Normal — don't bulk-delete.
|
||||
|
||||
## Lessons Learned
|
||||
- NEVER delete remote branches before confirming merge. Close PR, rebase locally, force-push if needed.
|
||||
- Stale rebase caused 5h factory stall once (2026-03-11). Auto-heal added to dev-agent.
|
||||
- lint-staged hooks fail when `forge` not in PATH. Use `--no-verify` when committing from scripts.
|
||||
29
factory/best-practices/memory.md
Normal file
29
factory/best-practices/memory.md
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
# Memory Best Practices
|
||||
|
||||
## Environment
|
||||
- VPS: 8GB RAM, 4GB swap, Debian
|
||||
- Running: Docker stack (8 containers), Woodpecker CI, OpenClaw gateway
|
||||
|
||||
## Safe Fixes (no permission needed)
|
||||
- Kill stale `claude` processes (>3h old): `pgrep -f "claude" --older 10800 | xargs kill`
|
||||
- Drop filesystem caches: `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`
|
||||
- Restart bloated Anvil: `sudo docker restart harb-anvil-1` (grows to 12GB+ over hours)
|
||||
- Kill orphan node processes from dead worktrees
|
||||
|
||||
## Dangerous (escalate)
|
||||
- `docker system prune -a --volumes` — kills CI images, hours to rebuild
|
||||
- Stopping harb stack containers — breaks dev environment
|
||||
- OOM that survives all safe fixes — needs human decision on what to kill
|
||||
|
||||
## Known Memory Hogs
|
||||
- `claude` processes from dev-agent: 200MB+ each, can zombie
|
||||
- `dockerd`: 600MB+ baseline (normal)
|
||||
- `openclaw-gateway`: 500MB+ (normal)
|
||||
- Anvil container: starts small, grows unbounded over hours
|
||||
- `forge build` with via_ir: can spike to 4GB+. Use `--skip test script` to reduce.
|
||||
- Vite dev servers inside containers: 150MB+ each
|
||||
|
||||
## Lessons Learned
|
||||
- After killing processes, always `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches`
|
||||
- Swap doesn't drain from dropping caches alone — it's actual paged-out process memory
|
||||
- Running CI + full harb stack = 14+ containers on 8GB. Only one pipeline at a time.
|
||||
Loading…
Add table
Add a link
Reference in a new issue