fix: incident: WP gRPC flake burned dev-qwen CI retry budget on #842 (2026-04-16) (#867)

This commit is contained in:
Agent 2026-04-17 01:22:59 +00:00
parent c3e58e88ed
commit 04ead1fbdc
4 changed files with 287 additions and 3 deletions

View file

@ -29,7 +29,7 @@ and injected into your prompt above. Review them now.
1. Read the injected metrics data carefully (System Resources, Docker,
Active Sessions, Phase Files, Stale Phase Cleanup, Lock Files, Agent Logs,
CI Pipelines, Open PRs, Issue Status, Stale Worktrees).
CI Pipelines, Open PRs, Issue Status, Stale Worktrees, **Woodpecker Agent Health**).
Note: preflight.sh auto-removes PHASE:escalate files for closed issues
(24h grace period). Check the "Stale Phase Cleanup" section for any
files cleaned or in grace period this run.
@ -75,6 +75,10 @@ Categorize every finding from the metrics into priority levels.
- Dev/action sessions in PHASE:escalate for > 24h (session timeout)
(Note: PHASE:escalate files for closed issues are auto-cleaned by preflight;
this check covers sessions where the issue is still open)
- **Woodpecker agent unhealthy** see "Woodpecker Agent Health" section in preflight:
- Container not running or in unhealthy state
- gRPC errors >= 3 in last 20 minutes
- Fast-failure pipelines (duration < 60s) >= 3 in last 15 minutes
### P3 — Factory degraded
- PRs stale: CI finished >20min ago AND no git push to the PR branch since CI completed
@ -100,6 +104,17 @@ For each finding from the health assessment, decide and execute an action.
### Auto-fixable (execute these directly)
**P2 Woodpecker agent unhealthy:**
The supervisor-run.sh script automatically handles WP agent recovery:
- Detects unhealthy state via preflight.sh health checks
- Restarts container via `docker restart`
- Scans for `blocked: ci_exhausted` issues updated in last 30 minutes
- Unassigns and removes blocked label from affected issues
- Posts recovery comment with infra-flake context
- Avoids duplicate restarts via 5-minute cooldown in history file
**P0 Memory crisis:**
**P0 Memory crisis:**
# Kill stale one-shot claude processes (>3h old)
pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
@ -248,6 +263,11 @@ Format:
- <what was fixed>
(or "No actions needed")
### WP Agent Recovery (if applicable)
- WP agent restart: <time of restart or "none">
- Issues recovered: <count>
- Reason: <health check reason or "healthy">
### Vault items filed
- vault/pending/<id>.md <reason>
(or "None")