Compare commits

..

No commits in common. "architect/supervisor-pr-sweeper" and "main" have entirely different histories.

View file

@ -1,38 +0,0 @@
# Sprint: supervisor-pr-sweeper
## Vision issues
- #894 — incident: PR #872 sat abandoned 2h22m after legit CI fail — supervisor needs an unblock-PR sweeper
## What this enables
After this sprint, the supervisor automatically detects abandoned PRs (open PR + blocked issue + failed CI + no recent activity), triages the failure cause, and either re-queues the issue with diagnosis or escalates repeat offenders. No more PRs sitting stuck for hours waiting for human intervention.
## What exists today
The majority of the infrastructure is already implemented:
- **CI log reading**: `lib/ci-helpers.sh` wraps `ci-log-reader.py` to read Woodpecker logs from SQLite. Last 200 lines, filterable by step.
- **Failure classification**: `classify_pipeline_failure()` in `ci-helpers.sh` distinguishes infra failures (OOM, timeout, connection) from code failures.
- **Issue lifecycle**: `lib/issue-lifecycle.sh` provides `issue_block()` / unblock helpers, label caching, diagnostic comment posting.
- **PR lifecycle**: `lib/pr-lifecycle.sh` tracks `ci_exhausted` exit reason and maintains per-PR fix counter JSON.
- **WP agent recovery**: `supervisor-run.sh` (#933) already restarts unhealthy Woodpecker agent containers and scans for `blocked:ci_exhausted` issues updated in last 30 minutes.
- **Preflight data**: `supervisor/preflight.sh` collects open PRs, blocked issues, stuck pipelines, and WP agent health.
- **Stale PR vault items**: The supervisor formula can file vault items for stale PRs (>3 consecutive runs).
## Complexity
- ~4 files touched: `supervisor/preflight.sh`, `supervisor/supervisor-run.sh`, `lib/ci-helpers.sh`, `formulas/run-supervisor.toml`
- Estimated 4 sub-issues
- ~85% gluecode (wiring existing CI-log + issue-lifecycle helpers), ~15% greenfield (triage prompt, verdict routing, escalation logic)
## Risks
- **False positive triage**: The Claude triage prompt (code vs test-harness vs ci-config vs flake) may misclassify, causing wrong recovery action. Mitigation: idempotency guard prevents re-processing, human can override.
- **Noisy re-queuing**: If the sweep re-queues genuinely hard issues, agents burn budget again. Mitigation: repeat-offender escalation stops re-queuing after 2 independent failures.
- **Triage prompt cost**: Each swept PR requires a small Claude call. At ~1 sweep per 20min cycle, negligible vs agent session costs.
- **Race with dev-poll**: Supervisor unblocking while dev-poll is mid-claim. Mitigation: `issue_claim()` uses assignee + label as atomic guard.
## Cost — new infra to maintain
- **No new services or containers** — runs inside existing supervisor polling loop
- **No new scheduled tasks** — uses existing 20-minute supervisor cron
- **One new triage helper function** in `lib/ci-helpers.sh`
- **Ongoing**: monitoring triage accuracy — false positives surface as re-blocked issues
## Recommendation
Worth it. Three validating cases (PRs #872, #896, #908) show this is a recurring pattern. The infrastructure for reading CI logs, classifying failures, and managing issue labels is already in place. The sprint is primarily wiring work. Ship this to close the single largest gap in factory self-healing.