fix: incident: WP gRPC flake burned dev-qwen CI retry budget on #842 (2026-04-16) (#867)

2026-04-17 01:22:59 +00:00 · 2026-04-17 01:22:59 +00:00 · 04ead1fbdc
commit 04ead1fbdc
parent c3e58e88ed
4 changed files with 287 additions and 3 deletions
--- a/formulas/run-supervisor.toml
+++ b/formulas/run-supervisor.toml
@ -29,7 +29,7 @@ and injected into your prompt above. Review them now.

 1. Read the injected metrics data carefully (System Resources, Docker,
   Active Sessions, Phase Files, Stale Phase Cleanup, Lock Files, Agent Logs,
-   CI Pipelines, Open PRs, Issue Status, Stale Worktrees).
+   CI Pipelines, Open PRs, Issue Status, Stale Worktrees, **Woodpecker Agent Health**).
   Note: preflight.sh auto-removes PHASE:escalate files for closed issues
   (24h grace period). Check the "Stale Phase Cleanup" section for any
   files cleaned or in grace period this run.
@ -75,6 +75,10 @@ Categorize every finding from the metrics into priority levels.
 - Dev/action sessions in PHASE:escalate for > 24h (session timeout)
  (Note: PHASE:escalate files for closed issues are auto-cleaned by preflight;
  this check covers sessions where the issue is still open)
+- **Woodpecker agent unhealthy** — see "Woodpecker Agent Health" section in preflight:
+  - Container not running or in unhealthy state
+  - gRPC errors >= 3 in last 20 minutes
+  - Fast-failure pipelines (duration < 60s) >= 3 in last 15 minutes

 ### P3 — Factory degraded
 - PRs stale: CI finished >20min ago AND no git push to the PR branch since CI completed
@ -100,6 +104,17 @@ For each finding from the health assessment, decide and execute an action.

 ### Auto-fixable (execute these directly)

+**P2 Woodpecker agent unhealthy:**
+The supervisor-run.sh script automatically handles WP agent recovery:
+- Detects unhealthy state via preflight.sh health checks
+- Restarts container via `docker restart`
+- Scans for `blocked: ci_exhausted` issues updated in last 30 minutes
+- Unassigns and removes blocked label from affected issues
+- Posts recovery comment with infra-flake context
+- Avoids duplicate restarts via 5-minute cooldown in history file
+
+**P0 Memory crisis:**
+
 **P0 Memory crisis:**
  # Kill stale one-shot claude processes (>3h old)
  pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true
@ -248,6 +263,11 @@ Format:
  - <what was fixed>
  (or "No actions needed")

+  ### WP Agent Recovery (if applicable)
+  - WP agent restart: <time of restart or "none">
+  - Issues recovered: <count>
+  - Reason: <health check reason or "healthy">
+
  ### Vault items filed
  - vault/pending/<id>.md — <reason>
  (or "None")