fix: bug: supervisor hardcodes ops repo expectation — fails silently on deployments without one (#544)
Add OPS repo presence detection in supervisor-run.sh with degraded mode support: - Detect if OPS_REPO_ROOT is missing and log WARNING message - Set OPS_REPO_DEGRADED=1 flag and configure fallback paths - Bundle minimal knowledge files as fallback for degraded mode - Update formula to use OPS_KNOWLEDGE_ROOT, OPS_JOURNAL_ROOT, OPS_VAULT_ROOT - Support local vault destination and journal fallback when ops repo absent Knowledge files bundled: disk.md, memory.md, ci.md, git.md, dev-agent.md, review-agent.md, forge.md The supervisor now runs with full functionality when ops repo is available, or gracefully degrades to local paths when absent, making the failure mode explicit rather than silent.
This commit is contained in:
parent
be5957f127
commit
f299bae77b
11 changed files with 278 additions and 16 deletions
28
knowledge/ci.md
Normal file
28
knowledge/ci.md
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
# CI/CD — Best Practices
|
||||
|
||||
## CI Pipeline Issues (P2)
|
||||
|
||||
When CI pipelines are stuck running >20min or pending >30min:
|
||||
|
||||
### Investigation Steps
|
||||
1. Check pipeline status via Forgejo API:
|
||||
```bash
|
||||
curl -sf -H "Authorization: token $FORGE_TOKEN" \
|
||||
"$FORGE_API/pipelines?limit=50" | jq '.[] | {number, status, created}'
|
||||
```
|
||||
|
||||
2. Check Woodpecker CI if configured:
|
||||
```bash
|
||||
curl -sf -H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
"$WOODPECKER_SERVER/api/repos/${WOODPECKER_REPO_ID}/pipelines?limit=10"
|
||||
```
|
||||
|
||||
### Common Fixes
|
||||
- **Stuck pipeline**: Cancel via Forgejo API, retrigger
|
||||
- **Pending pipeline**: Check queue depth, scale CI runners
|
||||
- **Failed pipeline**: Review logs, fix failing test/step
|
||||
|
||||
### Prevention
|
||||
- Set timeout limits on CI pipelines
|
||||
- Monitor runner capacity and scale as needed
|
||||
- Use caching for dependencies to reduce build time
|
||||
28
knowledge/dev-agent.md
Normal file
28
knowledge/dev-agent.md
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
# Dev Agent — Best Practices
|
||||
|
||||
## Dev Agent Issues (P2)
|
||||
|
||||
When dev-agent is stuck, blocked, or in bad state:
|
||||
|
||||
### Dead Lock File
|
||||
```bash
|
||||
# Check if process still exists
|
||||
ps -p $(cat /path/to/lock.file) 2>/dev/null || rm -f /path/to/lock.file
|
||||
```
|
||||
|
||||
### Stale Worktree Cleanup
|
||||
```bash
|
||||
cd "$PROJECT_REPO_ROOT"
|
||||
git worktree remove --force /tmp/stale-worktree 2>/dev/null || true
|
||||
git worktree prune 2>/dev/null || true
|
||||
```
|
||||
|
||||
### Blocked Pipeline
|
||||
- Check if PR is awaiting review or CI
|
||||
- Verify no other agent is actively working on same issue
|
||||
- Check for unmet dependencies (issues with `Depends on` refs)
|
||||
|
||||
### Prevention
|
||||
- Single-threaded pipeline per project (AD-002)
|
||||
- Clear lock files in EXIT traps
|
||||
- Use phase files to track agent state
|
||||
35
knowledge/disk.md
Normal file
35
knowledge/disk.md
Normal file
|
|
@ -0,0 +1,35 @@
|
|||
# Disk Management — Best Practices
|
||||
|
||||
## Disk Pressure Response (P1)
|
||||
|
||||
When disk usage exceeds 80%, take these actions in order:
|
||||
|
||||
### Immediate Actions
|
||||
1. **Docker cleanup** (safe, low impact):
|
||||
```bash
|
||||
sudo docker system prune -f
|
||||
```
|
||||
|
||||
2. **Aggressive Docker cleanup** (if still >80%):
|
||||
```bash
|
||||
sudo docker system prune -a -f
|
||||
```
|
||||
This removes unused images in addition to containers/volumes.
|
||||
|
||||
3. **Log rotation**:
|
||||
```bash
|
||||
for f in "$FACTORY_ROOT"/{dev,review,supervisor,gardener,planner,predictor}/*.log; do
|
||||
[ -f "$f" ] && [ "$(du -k "$f" | cut -f1)" -gt 10240 ] && truncate -s 0 "$f"
|
||||
done
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Monitor disk with alerts at 70% (warning) and 80% (critical)
|
||||
- Set up automatic log rotation for agent logs
|
||||
- Clean up old Docker images regularly
|
||||
- Consider using separate partitions for `/var/lib/docker`
|
||||
|
||||
### When to Escalate
|
||||
- Disk stays >80% after cleanup (indicates legitimate growth)
|
||||
- No unused Docker images to clean
|
||||
- Critical data filling disk (check /home, /var/log)
|
||||
25
knowledge/forge.md
Normal file
25
knowledge/forge.md
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
# Forgejo Operations — Best Practices
|
||||
|
||||
## Forgejo Issues
|
||||
|
||||
When Forgejo operations encounter issues:
|
||||
|
||||
### API Rate Limits
|
||||
- Monitor rate limit headers in API responses
|
||||
- Implement exponential backoff on 429 responses
|
||||
- Use agent-specific tokens (#747) to increase limits
|
||||
|
||||
### Authentication Issues
|
||||
- Verify FORGE_TOKEN is valid and not expired
|
||||
- Check agent identity matches token (#747)
|
||||
- Use FORGE_<AGENT>_TOKEN for agent-specific identities
|
||||
|
||||
### Repository Access
|
||||
- Verify FORGE_REMOTE matches actual git remote
|
||||
- Check token has appropriate permissions (repo, write)
|
||||
- Use `resolve_forge_remote()` to auto-detect remote
|
||||
|
||||
### Prevention
|
||||
- Set up monitoring for API failures
|
||||
- Rotate tokens before expiry
|
||||
- Document required permissions per agent
|
||||
28
knowledge/git.md
Normal file
28
knowledge/git.md
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
# Git State Recovery — Best Practices
|
||||
|
||||
## Git State Issues (P2)
|
||||
|
||||
When git repo is on wrong branch or in broken rebase state:
|
||||
|
||||
### Wrong Branch Recovery
|
||||
```bash
|
||||
cd "$PROJECT_REPO_ROOT"
|
||||
git checkout "$PRIMARY_BRANCH" 2>/dev/null || git checkout master 2>/dev/null
|
||||
```
|
||||
|
||||
### Broken Rebase Recovery
|
||||
```bash
|
||||
cd "$PROJECT_REPO_ROOT"
|
||||
git rebase --abort 2>/dev/null || true
|
||||
git checkout "$PRIMARY_BRANCH" 2>/dev/null || git checkout master 2>/dev/null
|
||||
```
|
||||
|
||||
### Stale Lock File Cleanup
|
||||
```bash
|
||||
rm -f /path/to/stale.lock
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Always checkout primary branch after rebase conflicts
|
||||
- Remove lock files after agent sessions complete
|
||||
- Use `git status` to verify repo state before operations
|
||||
27
knowledge/memory.md
Normal file
27
knowledge/memory.md
Normal file
|
|
@ -0,0 +1,27 @@
|
|||
# Memory Management — Best Practices
|
||||
|
||||
## Memory Crisis Response (P0)
|
||||
|
||||
When RAM available drops below 500MB or swap usage exceeds 3GB, take these actions:
|
||||
|
||||
### Immediate Actions
|
||||
1. **Kill stale claude processes** (>3 hours old):
|
||||
```bash
|
||||
pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
|
||||
```
|
||||
|
||||
2. **Drop filesystem caches**:
|
||||
```bash
|
||||
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null 2>&1 || true
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Set memory_guard to 2000MB minimum (default in env.sh)
|
||||
- Configure swap usage alerts at 2GB
|
||||
- Monitor for memory leaks in long-running processes
|
||||
- Use cgroups for process memory limits
|
||||
|
||||
### When to Escalate
|
||||
- RAM stays <500MB after cache drop
|
||||
- Swap continues growing after process kills
|
||||
- System becomes unresponsive (OOM killer active)
|
||||
23
knowledge/review-agent.md
Normal file
23
knowledge/review-agent.md
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
# Review Agent — Best Practices
|
||||
|
||||
## Review Agent Issues
|
||||
|
||||
When review agent encounters issues with PRs:
|
||||
|
||||
### Stale PR Handling
|
||||
- PRs stale >20min (CI done, no push since) → file vault item for dev-agent
|
||||
- Do NOT push branches or attempt merges directly
|
||||
- File vault item with:
|
||||
- What: Stale PR requiring push
|
||||
- Why: Factory degraded
|
||||
- Unblocks: dev-agent will push the branch
|
||||
|
||||
### Circular Dependencies
|
||||
- Check backlog for issues with circular `Depends on` refs
|
||||
- Use `lib/parse-deps.sh` to analyze dependency graph
|
||||
- Report to planner for resolution
|
||||
|
||||
### Prevention
|
||||
- Review agent only reads PRs, never modifies
|
||||
- Use vault items for actions requiring dev-agent
|
||||
- Monitor for PRs stuck in review state
|
||||
Loading…
Add table
Add a link
Reference in a new issue