From 04ead1fbdce8284af0642545b87435ace796677f Mon Sep 17 00:00:00 2001 From: Agent Date: Fri, 17 Apr 2026 01:22:59 +0000 Subject: [PATCH 1/2] fix: incident: WP gRPC flake burned dev-qwen CI retry budget on #842 (2026-04-16) (#867) --- formulas/run-supervisor.toml | 22 ++++- supervisor/AGENTS.md | 7 +- supervisor/preflight.sh | 105 +++++++++++++++++++++++ supervisor/supervisor-run.sh | 156 +++++++++++++++++++++++++++++++++++ 4 files changed, 287 insertions(+), 3 deletions(-) diff --git a/formulas/run-supervisor.toml b/formulas/run-supervisor.toml index f31e6bc..e623187 100644 --- a/formulas/run-supervisor.toml +++ b/formulas/run-supervisor.toml @@ -29,7 +29,7 @@ and injected into your prompt above. Review them now. 1. Read the injected metrics data carefully (System Resources, Docker, Active Sessions, Phase Files, Stale Phase Cleanup, Lock Files, Agent Logs, - CI Pipelines, Open PRs, Issue Status, Stale Worktrees). + CI Pipelines, Open PRs, Issue Status, Stale Worktrees, **Woodpecker Agent Health**). Note: preflight.sh auto-removes PHASE:escalate files for closed issues (24h grace period). Check the "Stale Phase Cleanup" section for any files cleaned or in grace period this run. @@ -75,6 +75,10 @@ Categorize every finding from the metrics into priority levels. - Dev/action sessions in PHASE:escalate for > 24h (session timeout) (Note: PHASE:escalate files for closed issues are auto-cleaned by preflight; this check covers sessions where the issue is still open) +- **Woodpecker agent unhealthy** — see "Woodpecker Agent Health" section in preflight: + - Container not running or in unhealthy state + - gRPC errors >= 3 in last 20 minutes + - Fast-failure pipelines (duration < 60s) >= 3 in last 15 minutes ### P3 — Factory degraded - PRs stale: CI finished >20min ago AND no git push to the PR branch since CI completed @@ -100,6 +104,17 @@ For each finding from the health assessment, decide and execute an action. ### Auto-fixable (execute these directly) +**P2 Woodpecker agent unhealthy:** +The supervisor-run.sh script automatically handles WP agent recovery: +- Detects unhealthy state via preflight.sh health checks +- Restarts container via `docker restart` +- Scans for `blocked: ci_exhausted` issues updated in last 30 minutes +- Unassigns and removes blocked label from affected issues +- Posts recovery comment with infra-flake context +- Avoids duplicate restarts via 5-minute cooldown in history file + +**P0 Memory crisis:** + **P0 Memory crisis:** # Kill stale one-shot claude processes (>3h old) pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true @@ -248,6 +263,11 @@ Format: - (or "No actions needed") + ### WP Agent Recovery (if applicable) + - WP agent restart: