feat: factory stall detection + Codeberg rate-limit best practices
- New P2 check: backlog exists + no agent ran in 20min = stalled - best-practices/codeberg.md: rate limiting awareness, retrigger cooldown - PROMPT.md: added codeberg best-practices reference
This commit is contained in:
parent
56cf332575
commit
04e80ee391
3 changed files with 61 additions and 0 deletions
|
|
@ -19,6 +19,7 @@ Before acting, read the relevant best-practices file:
|
|||
- Memory issues → `cat ${FACTORY_ROOT}/factory/best-practices/memory.md`
|
||||
- Disk issues → `cat ${FACTORY_ROOT}/factory/best-practices/disk.md`
|
||||
- CI issues → `cat ${FACTORY_ROOT}/factory/best-practices/ci.md`
|
||||
- Codeberg / rate limits → `cat ${FACTORY_ROOT}/factory/best-practices/codeberg.md`
|
||||
- Dev-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/dev-agent.md`
|
||||
- Review-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/review-agent.md`
|
||||
- Git issues → `cat ${FACTORY_ROOT}/factory/best-practices/git.md`
|
||||
|
|
|
|||
36
factory/best-practices/codeberg.md
Normal file
36
factory/best-practices/codeberg.md
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
# Codeberg Best Practices
|
||||
|
||||
## Rate Limiting
|
||||
Codeberg rate-limits SSH and HTTPS clones. Symptoms:
|
||||
- Woodpecker `git` step fails with exit code 128
|
||||
- Multiple pipelines fail in quick succession with the same error
|
||||
- Retriggers make it WORSE by adding more clone attempts
|
||||
|
||||
### What To Do
|
||||
- **Do NOT retrigger** during a rate-limit storm. Wait 10-15 minutes.
|
||||
- Check if multiple pipelines failed on `git` step recently:
|
||||
```bash
|
||||
wpdb -c "SELECT number, status, to_timestamp(started) FROM pipelines WHERE repo_id=2 AND status='failure' ORDER BY number DESC LIMIT 5;"
|
||||
wpdb -c "SELECT s.name, s.exit_code FROM steps s JOIN pipelines p ON s.pipeline_id=p.id WHERE p.number=<N> AND p.repo_id=2 AND s.state='failure';"
|
||||
```
|
||||
- If multiple `git` failures with exit 128 in the last 15 min → it's rate limiting. Wait.
|
||||
- Only retrigger after 15+ minutes of no CI activity.
|
||||
|
||||
### How To Retrigger Safely
|
||||
```bash
|
||||
cd <worktree> && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin <branch> --force
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- The factory runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously.
|
||||
- One pipeline at a time is ideal on this VPS (resource + rate limit reasons).
|
||||
- If >3 pipelines are pending/running, do NOT create more work.
|
||||
|
||||
## OAuth Tokens
|
||||
- OAuth tokens expire ~2h. If Codeberg is down during refresh, re-login required.
|
||||
- API token is in `~/.netrc` — read via `awk` in env.sh.
|
||||
- Review bot has a separate token ($REVIEW_BOT_TOKEN) for formal reviews.
|
||||
|
||||
## Lessons Learned
|
||||
- Retrigger storm on 2026-03-12: supervisor + dev-agent both retriggered during rate limit, caused 5+ failed pipelines. Added cooldown awareness.
|
||||
- Empty commit retrigger works but adds noise to git history. Acceptable tradeoff.
|
||||
|
|
@ -192,6 +192,30 @@ if [ "$GIT_BRANCH" != "master" ] && [ "$GIT_BRANCH" != "unknown" ]; then
|
|||
p2 "Git: on '${GIT_BRANCH}' instead of master"
|
||||
fi
|
||||
|
||||
# =============================================================================
|
||||
# P2b: FACTORY STALLED — backlog exists but no agent running
|
||||
# =============================================================================
|
||||
status "P2: checking factory stall"
|
||||
|
||||
BACKLOG_COUNT=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
|
||||
IN_PROGRESS=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
|
||||
|
||||
if [ "${BACKLOG_COUNT:-0}" -gt 0 ] && [ "${IN_PROGRESS:-0}" -eq 0 ]; then
|
||||
# Backlog exists but nothing in progress — check if dev-agent ran recently
|
||||
DEV_LOG="${FACTORY_ROOT}/dev/dev-agent.log"
|
||||
if [ -f "$DEV_LOG" ]; then
|
||||
LAST_LOG_EPOCH=$(stat -c %Y "$DEV_LOG" 2>/dev/null || echo 0)
|
||||
else
|
||||
LAST_LOG_EPOCH=0
|
||||
fi
|
||||
NOW_EPOCH=$(date +%s)
|
||||
IDLE_MIN=$(( (NOW_EPOCH - LAST_LOG_EPOCH) / 60 ))
|
||||
|
||||
if [ "$IDLE_MIN" -gt 20 ]; then
|
||||
p2 "Factory stalled: ${BACKLOG_COUNT} backlog issue(s), no agent ran for ${IDLE_MIN}min"
|
||||
fi
|
||||
fi
|
||||
|
||||
# =============================================================================
|
||||
# P3: FACTORY DEGRADED — derailed PRs, unreviewed PRs
|
||||
# =============================================================================
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue