feat: factory stall detection + Codeberg rate-limit best practices

- New P2 check: backlog exists + no agent ran in 20min = stalled
- best-practices/codeberg.md: rate limiting awareness, retrigger cooldown
- PROMPT.md: added codeberg best-practices reference
This commit is contained in:
openhands 2026-03-12 18:06:08 +00:00
parent 56cf332575
commit 04e80ee391
3 changed files with 61 additions and 0 deletions

View file

@ -19,6 +19,7 @@ Before acting, read the relevant best-practices file:
- Memory issues → `cat ${FACTORY_ROOT}/factory/best-practices/memory.md`
- Disk issues → `cat ${FACTORY_ROOT}/factory/best-practices/disk.md`
- CI issues → `cat ${FACTORY_ROOT}/factory/best-practices/ci.md`
- Codeberg / rate limits → `cat ${FACTORY_ROOT}/factory/best-practices/codeberg.md`
- Dev-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/dev-agent.md`
- Review-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/review-agent.md`
- Git issues → `cat ${FACTORY_ROOT}/factory/best-practices/git.md`

View file

@ -0,0 +1,36 @@
# Codeberg Best Practices
## Rate Limiting
Codeberg rate-limits SSH and HTTPS clones. Symptoms:
- Woodpecker `git` step fails with exit code 128
- Multiple pipelines fail in quick succession with the same error
- Retriggers make it WORSE by adding more clone attempts
### What To Do
- **Do NOT retrigger** during a rate-limit storm. Wait 10-15 minutes.
- Check if multiple pipelines failed on `git` step recently:
```bash
wpdb -c "SELECT number, status, to_timestamp(started) FROM pipelines WHERE repo_id=2 AND status='failure' ORDER BY number DESC LIMIT 5;"
wpdb -c "SELECT s.name, s.exit_code FROM steps s JOIN pipelines p ON s.pipeline_id=p.id WHERE p.number=<N> AND p.repo_id=2 AND s.state='failure';"
```
- If multiple `git` failures with exit 128 in the last 15 min → it's rate limiting. Wait.
- Only retrigger after 15+ minutes of no CI activity.
### How To Retrigger Safely
```bash
cd <worktree> && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin <branch> --force
```
### Prevention
- The factory runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously.
- One pipeline at a time is ideal on this VPS (resource + rate limit reasons).
- If >3 pipelines are pending/running, do NOT create more work.
## OAuth Tokens
- OAuth tokens expire ~2h. If Codeberg is down during refresh, re-login required.
- API token is in `~/.netrc` — read via `awk` in env.sh.
- Review bot has a separate token ($REVIEW_BOT_TOKEN) for formal reviews.
## Lessons Learned
- Retrigger storm on 2026-03-12: supervisor + dev-agent both retriggered during rate limit, caused 5+ failed pipelines. Added cooldown awareness.
- Empty commit retrigger works but adds noise to git history. Acceptable tradeoff.

View file

@ -192,6 +192,30 @@ if [ "$GIT_BRANCH" != "master" ] && [ "$GIT_BRANCH" != "unknown" ]; then
p2 "Git: on '${GIT_BRANCH}' instead of master"
fi
# =============================================================================
# P2b: FACTORY STALLED — backlog exists but no agent running
# =============================================================================
status "P2: checking factory stall"
BACKLOG_COUNT=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
IN_PROGRESS=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
if [ "${BACKLOG_COUNT:-0}" -gt 0 ] && [ "${IN_PROGRESS:-0}" -eq 0 ]; then
# Backlog exists but nothing in progress — check if dev-agent ran recently
DEV_LOG="${FACTORY_ROOT}/dev/dev-agent.log"
if [ -f "$DEV_LOG" ]; then
LAST_LOG_EPOCH=$(stat -c %Y "$DEV_LOG" 2>/dev/null || echo 0)
else
LAST_LOG_EPOCH=0
fi
NOW_EPOCH=$(date +%s)
IDLE_MIN=$(( (NOW_EPOCH - LAST_LOG_EPOCH) / 60 ))
if [ "$IDLE_MIN" -gt 20 ]; then
p2 "Factory stalled: ${BACKLOG_COUNT} backlog issue(s), no agent ran for ${IDLE_MIN}min"
fi
fi
# =============================================================================
# P3: FACTORY DEGRADED — derailed PRs, unreviewed PRs
# =============================================================================