fix: bug: supervisor hardcodes ops repo expectation — fails silently on deployments without one (#544)

Add OPS repo presence detection in supervisor-run.sh with degraded mode support: - Detect if OPS_REPO_ROOT is missing and log WARNING message - Set OPS_REPO_DEGRADED=1 flag and configure fallback paths - Bundle minimal knowledge files as fallback for degraded mode - Update formula to use OPS_KNOWLEDGE_ROOT, OPS_JOURNAL_ROOT, OPS_VAULT_ROOT - Support local vault destination and journal fallback when ops repo absent Knowledge files bundled: disk.md, memory.md, ci.md, git.md, dev-agent.md, review-agent.md, forge.md The supervisor now runs with full functionality when ops repo is available, or gracefully degrades to local paths when absent, making the failure mode explicit rather than silent.
2026-04-10 08:16:03 +00:00 · 2026-04-10 08:16:03 +00:00 · f299bae77b
commit f299bae77b
parent be5957f127
11 changed files with 278 additions and 16 deletions
--- a/knowledge/ci.md
+++ b/knowledge/ci.md
@ -0,0 +1,28 @@
+# CI/CD — Best Practices
+
+## CI Pipeline Issues (P2)
+
+When CI pipelines are stuck running >20min or pending >30min:
+
+### Investigation Steps
+1. Check pipeline status via Forgejo API:
+   ```bash
+   curl -sf -H "Authorization: token $FORGE_TOKEN" \
+     "$FORGE_API/pipelines?limit=50" | jq '.[] | {number, status, created}'
+   ```
+
+2. Check Woodpecker CI if configured:
+   ```bash
+   curl -sf -H "Authorization: Bearer $WOODPECKER_TOKEN" \
+     "$WOODPECKER_SERVER/api/repos/${WOODPECKER_REPO_ID}/pipelines?limit=10"
+   ```
+
+### Common Fixes
+- **Stuck pipeline**: Cancel via Forgejo API, retrigger
+- **Pending pipeline**: Check queue depth, scale CI runners
+- **Failed pipeline**: Review logs, fix failing test/step
+
+### Prevention
+- Set timeout limits on CI pipelines
+- Monitor runner capacity and scale as needed
+- Use caching for dependencies to reduce build time
--- a/knowledge/dev-agent.md
+++ b/knowledge/dev-agent.md
@ -0,0 +1,28 @@
+# Dev Agent — Best Practices
+
+## Dev Agent Issues (P2)
+
+When dev-agent is stuck, blocked, or in bad state:
+
+### Dead Lock File
+```bash
+# Check if process still exists
+ps -p $(cat /path/to/lock.file) 2>/dev/null || rm -f /path/to/lock.file
+```
+
+### Stale Worktree Cleanup
+```bash
+cd "$PROJECT_REPO_ROOT"
+git worktree remove --force /tmp/stale-worktree 2>/dev/null || true
+git worktree prune 2>/dev/null || true
+```
+
+### Blocked Pipeline
+- Check if PR is awaiting review or CI
+- Verify no other agent is actively working on same issue
+- Check for unmet dependencies (issues with `Depends on` refs)
+
+### Prevention
+- Single-threaded pipeline per project (AD-002)
+- Clear lock files in EXIT traps
+- Use phase files to track agent state
--- a/knowledge/disk.md
+++ b/knowledge/disk.md
@ -0,0 +1,35 @@
+# Disk Management — Best Practices
+
+## Disk Pressure Response (P1)
+
+When disk usage exceeds 80%, take these actions in order:
+
+### Immediate Actions
+1. **Docker cleanup** (safe, low impact):
+   ```bash
+   sudo docker system prune -f
+   ```
+
+2. **Aggressive Docker cleanup** (if still >80%):
+   ```bash
+   sudo docker system prune -a -f
+   ```
+   This removes unused images in addition to containers/volumes.
+
+3. **Log rotation**:
+   ```bash
+   for f in "$FACTORY_ROOT"/{dev,review,supervisor,gardener,planner,predictor}/*.log; do
+     [ -f "$f" ] && [ "$(du -k "$f" | cut -f1)" -gt 10240 ] && truncate -s 0 "$f"
+   done
+   ```
+
+### Prevention
+- Monitor disk with alerts at 70% (warning) and 80% (critical)
+- Set up automatic log rotation for agent logs
+- Clean up old Docker images regularly
+- Consider using separate partitions for `/var/lib/docker`
+
+### When to Escalate
+- Disk stays >80% after cleanup (indicates legitimate growth)
+- No unused Docker images to clean
+- Critical data filling disk (check /home, /var/log)
--- a/knowledge/forge.md
+++ b/knowledge/forge.md
@ -0,0 +1,25 @@
+# Forgejo Operations — Best Practices
+
+## Forgejo Issues
+
+When Forgejo operations encounter issues:
+
+### API Rate Limits
+- Monitor rate limit headers in API responses
+- Implement exponential backoff on 429 responses
+- Use agent-specific tokens (#747) to increase limits
+
+### Authentication Issues
+- Verify FORGE_TOKEN is valid and not expired
+- Check agent identity matches token (#747)
+- Use FORGE_<AGENT>_TOKEN for agent-specific identities
+
+### Repository Access
+- Verify FORGE_REMOTE matches actual git remote
+- Check token has appropriate permissions (repo, write)
+- Use `resolve_forge_remote()` to auto-detect remote
+
+### Prevention
+- Set up monitoring for API failures
+- Rotate tokens before expiry
+- Document required permissions per agent
--- a/knowledge/git.md
+++ b/knowledge/git.md
@ -0,0 +1,28 @@
+# Git State Recovery — Best Practices
+
+## Git State Issues (P2)
+
+When git repo is on wrong branch or in broken rebase state:
+
+### Wrong Branch Recovery
+```bash
+cd "$PROJECT_REPO_ROOT"
+git checkout "$PRIMARY_BRANCH" 2>/dev/null || git checkout master 2>/dev/null
+```
+
+### Broken Rebase Recovery
+```bash
+cd "$PROJECT_REPO_ROOT"
+git rebase --abort 2>/dev/null || true
+git checkout "$PRIMARY_BRANCH" 2>/dev/null || git checkout master 2>/dev/null
+```
+
+### Stale Lock File Cleanup
+```bash
+rm -f /path/to/stale.lock
+```
+
+### Prevention
+- Always checkout primary branch after rebase conflicts
+- Remove lock files after agent sessions complete
+- Use `git status` to verify repo state before operations
--- a/knowledge/memory.md
+++ b/knowledge/memory.md
@ -0,0 +1,27 @@
+# Memory Management — Best Practices
+
+## Memory Crisis Response (P0)
+
+When RAM available drops below 500MB or swap usage exceeds 3GB, take these actions:
+
+### Immediate Actions
+1. **Kill stale claude processes** (>3 hours old):
+   ```bash
+   pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
+   ```
+
+2. **Drop filesystem caches**:
+   ```bash
+   sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null 2>&1 || true
+   ```
+
+### Prevention
+- Set memory_guard to 2000MB minimum (default in env.sh)
+- Configure swap usage alerts at 2GB
+- Monitor for memory leaks in long-running processes
+- Use cgroups for process memory limits
+
+### When to Escalate
+- RAM stays <500MB after cache drop
+- Swap continues growing after process kills
+- System becomes unresponsive (OOM killer active)
--- a/knowledge/review-agent.md
+++ b/knowledge/review-agent.md
@ -0,0 +1,23 @@
+# Review Agent — Best Practices
+
+## Review Agent Issues
+
+When review agent encounters issues with PRs:
+
+### Stale PR Handling
+- PRs stale >20min (CI done, no push since) → file vault item for dev-agent
+- Do NOT push branches or attempt merges directly
+- File vault item with:
+  - What: Stale PR requiring push
+  - Why: Factory degraded
+  - Unblocks: dev-agent will push the branch
+
+### Circular Dependencies
+- Check backlog for issues with circular `Depends on` refs
+- Use `lib/parse-deps.sh` to analyze dependency graph
+- Report to planner for resolution
+
+### Prevention
+- Review agent only reads PRs, never modifies
+- Use vault items for actions requiring dev-agent
+- Monitor for PRs stuck in review state