feat: factory stall detection + Codeberg rate-limit best practices
- New P2 check: backlog exists + no agent ran in 20min = stalled - best-practices/codeberg.md: rate limiting awareness, retrigger cooldown - PROMPT.md: added codeberg best-practices reference
This commit is contained in:
parent
56cf332575
commit
04e80ee391
3 changed files with 61 additions and 0 deletions
|
|
@ -19,6 +19,7 @@ Before acting, read the relevant best-practices file:
|
||||||
- Memory issues → `cat ${FACTORY_ROOT}/factory/best-practices/memory.md`
|
- Memory issues → `cat ${FACTORY_ROOT}/factory/best-practices/memory.md`
|
||||||
- Disk issues → `cat ${FACTORY_ROOT}/factory/best-practices/disk.md`
|
- Disk issues → `cat ${FACTORY_ROOT}/factory/best-practices/disk.md`
|
||||||
- CI issues → `cat ${FACTORY_ROOT}/factory/best-practices/ci.md`
|
- CI issues → `cat ${FACTORY_ROOT}/factory/best-practices/ci.md`
|
||||||
|
- Codeberg / rate limits → `cat ${FACTORY_ROOT}/factory/best-practices/codeberg.md`
|
||||||
- Dev-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/dev-agent.md`
|
- Dev-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/dev-agent.md`
|
||||||
- Review-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/review-agent.md`
|
- Review-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/review-agent.md`
|
||||||
- Git issues → `cat ${FACTORY_ROOT}/factory/best-practices/git.md`
|
- Git issues → `cat ${FACTORY_ROOT}/factory/best-practices/git.md`
|
||||||
|
|
|
||||||
36
factory/best-practices/codeberg.md
Normal file
36
factory/best-practices/codeberg.md
Normal file
|
|
@ -0,0 +1,36 @@
|
||||||
|
# Codeberg Best Practices
|
||||||
|
|
||||||
|
## Rate Limiting
|
||||||
|
Codeberg rate-limits SSH and HTTPS clones. Symptoms:
|
||||||
|
- Woodpecker `git` step fails with exit code 128
|
||||||
|
- Multiple pipelines fail in quick succession with the same error
|
||||||
|
- Retriggers make it WORSE by adding more clone attempts
|
||||||
|
|
||||||
|
### What To Do
|
||||||
|
- **Do NOT retrigger** during a rate-limit storm. Wait 10-15 minutes.
|
||||||
|
- Check if multiple pipelines failed on `git` step recently:
|
||||||
|
```bash
|
||||||
|
wpdb -c "SELECT number, status, to_timestamp(started) FROM pipelines WHERE repo_id=2 AND status='failure' ORDER BY number DESC LIMIT 5;"
|
||||||
|
wpdb -c "SELECT s.name, s.exit_code FROM steps s JOIN pipelines p ON s.pipeline_id=p.id WHERE p.number=<N> AND p.repo_id=2 AND s.state='failure';"
|
||||||
|
```
|
||||||
|
- If multiple `git` failures with exit 128 in the last 15 min → it's rate limiting. Wait.
|
||||||
|
- Only retrigger after 15+ minutes of no CI activity.
|
||||||
|
|
||||||
|
### How To Retrigger Safely
|
||||||
|
```bash
|
||||||
|
cd <worktree> && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin <branch> --force
|
||||||
|
```
|
||||||
|
|
||||||
|
### Prevention
|
||||||
|
- The factory runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously.
|
||||||
|
- One pipeline at a time is ideal on this VPS (resource + rate limit reasons).
|
||||||
|
- If >3 pipelines are pending/running, do NOT create more work.
|
||||||
|
|
||||||
|
## OAuth Tokens
|
||||||
|
- OAuth tokens expire ~2h. If Codeberg is down during refresh, re-login required.
|
||||||
|
- API token is in `~/.netrc` — read via `awk` in env.sh.
|
||||||
|
- Review bot has a separate token ($REVIEW_BOT_TOKEN) for formal reviews.
|
||||||
|
|
||||||
|
## Lessons Learned
|
||||||
|
- Retrigger storm on 2026-03-12: supervisor + dev-agent both retriggered during rate limit, caused 5+ failed pipelines. Added cooldown awareness.
|
||||||
|
- Empty commit retrigger works but adds noise to git history. Acceptable tradeoff.
|
||||||
|
|
@ -192,6 +192,30 @@ if [ "$GIT_BRANCH" != "master" ] && [ "$GIT_BRANCH" != "unknown" ]; then
|
||||||
p2 "Git: on '${GIT_BRANCH}' instead of master"
|
p2 "Git: on '${GIT_BRANCH}' instead of master"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# P2b: FACTORY STALLED — backlog exists but no agent running
|
||||||
|
# =============================================================================
|
||||||
|
status "P2: checking factory stall"
|
||||||
|
|
||||||
|
BACKLOG_COUNT=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
|
||||||
|
IN_PROGRESS=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "${BACKLOG_COUNT:-0}" -gt 0 ] && [ "${IN_PROGRESS:-0}" -eq 0 ]; then
|
||||||
|
# Backlog exists but nothing in progress — check if dev-agent ran recently
|
||||||
|
DEV_LOG="${FACTORY_ROOT}/dev/dev-agent.log"
|
||||||
|
if [ -f "$DEV_LOG" ]; then
|
||||||
|
LAST_LOG_EPOCH=$(stat -c %Y "$DEV_LOG" 2>/dev/null || echo 0)
|
||||||
|
else
|
||||||
|
LAST_LOG_EPOCH=0
|
||||||
|
fi
|
||||||
|
NOW_EPOCH=$(date +%s)
|
||||||
|
IDLE_MIN=$(( (NOW_EPOCH - LAST_LOG_EPOCH) / 60 ))
|
||||||
|
|
||||||
|
if [ "$IDLE_MIN" -gt 20 ]; then
|
||||||
|
p2 "Factory stalled: ${BACKLOG_COUNT} backlog issue(s), no agent ran for ${IDLE_MIN}min"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# P3: FACTORY DEGRADED — derailed PRs, unreviewed PRs
|
# P3: FACTORY DEGRADED — derailed PRs, unreviewed PRs
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue