diff --git a/factory/PROMPT.md b/factory/PROMPT.md index 77ff16f..4dad996 100644 --- a/factory/PROMPT.md +++ b/factory/PROMPT.md @@ -19,6 +19,7 @@ Before acting, read the relevant best-practices file: - Memory issues → `cat ${FACTORY_ROOT}/factory/best-practices/memory.md` - Disk issues → `cat ${FACTORY_ROOT}/factory/best-practices/disk.md` - CI issues → `cat ${FACTORY_ROOT}/factory/best-practices/ci.md` +- Codeberg / rate limits → `cat ${FACTORY_ROOT}/factory/best-practices/codeberg.md` - Dev-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/dev-agent.md` - Review-agent issues → `cat ${FACTORY_ROOT}/factory/best-practices/review-agent.md` - Git issues → `cat ${FACTORY_ROOT}/factory/best-practices/git.md` diff --git a/factory/best-practices/codeberg.md b/factory/best-practices/codeberg.md new file mode 100644 index 0000000..714c409 --- /dev/null +++ b/factory/best-practices/codeberg.md @@ -0,0 +1,36 @@ +# Codeberg Best Practices + +## Rate Limiting +Codeberg rate-limits SSH and HTTPS clones. Symptoms: +- Woodpecker `git` step fails with exit code 128 +- Multiple pipelines fail in quick succession with the same error +- Retriggers make it WORSE by adding more clone attempts + +### What To Do +- **Do NOT retrigger** during a rate-limit storm. Wait 10-15 minutes. +- Check if multiple pipelines failed on `git` step recently: + ```bash + wpdb -c "SELECT number, status, to_timestamp(started) FROM pipelines WHERE repo_id=2 AND status='failure' ORDER BY number DESC LIMIT 5;" + wpdb -c "SELECT s.name, s.exit_code FROM steps s JOIN pipelines p ON s.pipeline_id=p.id WHERE p.number= AND p.repo_id=2 AND s.state='failure';" + ``` +- If multiple `git` failures with exit 128 in the last 15 min → it's rate limiting. Wait. +- Only retrigger after 15+ minutes of no CI activity. + +### How To Retrigger Safely +```bash +cd && git commit --allow-empty -m "ci: retrigger" --no-verify && git push origin --force +``` + +### Prevention +- The factory runs 3 agents staggered by 3 minutes. During heavy development, many PRs trigger CI simultaneously. +- One pipeline at a time is ideal on this VPS (resource + rate limit reasons). +- If >3 pipelines are pending/running, do NOT create more work. + +## OAuth Tokens +- OAuth tokens expire ~2h. If Codeberg is down during refresh, re-login required. +- API token is in `~/.netrc` — read via `awk` in env.sh. +- Review bot has a separate token ($REVIEW_BOT_TOKEN) for formal reviews. + +## Lessons Learned +- Retrigger storm on 2026-03-12: supervisor + dev-agent both retriggered during rate limit, caused 5+ failed pipelines. Added cooldown awareness. +- Empty commit retrigger works but adds noise to git history. Acceptable tradeoff. diff --git a/factory/factory-poll.sh b/factory/factory-poll.sh index 69c0fff..0442c17 100755 --- a/factory/factory-poll.sh +++ b/factory/factory-poll.sh @@ -192,6 +192,30 @@ if [ "$GIT_BRANCH" != "master" ] && [ "$GIT_BRANCH" != "unknown" ]; then p2 "Git: on '${GIT_BRANCH}' instead of master" fi +# ============================================================================= +# P2b: FACTORY STALLED — backlog exists but no agent running +# ============================================================================= +status "P2: checking factory stall" + +BACKLOG_COUNT=$(codeberg_api GET "/issues?state=open&labels=backlog&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0") +IN_PROGRESS=$(codeberg_api GET "/issues?state=open&labels=in-progress&type=issues&limit=1" 2>/dev/null | jq -r 'length' 2>/dev/null || echo "0") + +if [ "${BACKLOG_COUNT:-0}" -gt 0 ] && [ "${IN_PROGRESS:-0}" -eq 0 ]; then + # Backlog exists but nothing in progress — check if dev-agent ran recently + DEV_LOG="${FACTORY_ROOT}/dev/dev-agent.log" + if [ -f "$DEV_LOG" ]; then + LAST_LOG_EPOCH=$(stat -c %Y "$DEV_LOG" 2>/dev/null || echo 0) + else + LAST_LOG_EPOCH=0 + fi + NOW_EPOCH=$(date +%s) + IDLE_MIN=$(( (NOW_EPOCH - LAST_LOG_EPOCH) / 60 )) + + if [ "$IDLE_MIN" -gt 20 ]; then + p2 "Factory stalled: ${BACKLOG_COUNT} backlog issue(s), no agent ran for ${IDLE_MIN}min" + fi +fi + # ============================================================================= # P3: FACTORY DEGRADED — derailed PRs, unreviewed PRs # =============================================================================