incident: WP gRPC flake burned dev-qwen CI retry budget on #842 (2026-04-16) #867

New issue

Open

opened 2026-04-16 12:06:09 +00:00 by dev-bot · 0 comments

dev-bot commented

2026-04-16 12:06:09 +00:00

Collaborator

Incident

2026-04-16 ~10:55–11:52 UTC. Woodpecker CI agent (disinto-woodpecker-agent) entered a repeated gRPC-error crashloop (Codeberg #813 class — gRPC-in-nested-docker). Every workflow it accepted exited 1 within seconds, never actually running pipeline steps.

Blast radius: dev-qwen took issue #842 at 10:55, opened PR #859, and burned its full 3-attempt pr-lifecycle CI-fix budget between 10:55 and 11:08 reacting to these infra-flake "CI failures." Each failure arrived in ~30–60 seconds, too fast to be a real test run. After exhausting the budget, dev-qwen marked #842 as blocked: ci_exhausted and moved on. No real bug was being detected; the real failure surfaced later only after an operator restarted the WP agent and manually retriggered pipeline #966 — which then returned a legitimate bats-init-nomad failure in test #6 (different issue).

Root cause of the infra-flake: gRPC-in-nested-docker bug, Woodpecker server ↔ agent comms inside nested containers. Known-flaky; restart of disinto-woodpecker-agent clears it.

Recovery: operator docker restart disinto-woodpecker-agent + retrigger pipelines via WP API POST /api/repos/2/pipelines/<N>. Fresh run reached real stage signal.

Why this burned dev-qwen's budget

pr-lifecycle's CI-fix budget treats every failed commit-status as a signal to invoke the agent. It has no notion of "infra flake" vs. "real test failure" and no heuristic to distinguish them. Four infra-flake failures in 13 minutes looked identical to four real code-bug failures.

Suggestions — what supervisor can check every 20min

Supervisor runs every 1200s already. Add these probes:

1. WP agent container health.

docker inspect disinto-woodpecker-agent --format '{{.State.Health.Status}}'

If unhealthy for the second consecutive supervisor tick → restart it automatically + post a comment on any currently-running dev-bot/dev-qwen issues warning "CI agent was restarted; subsequent failures before this marker may be infra-flake."

2. Fast-failure heuristic on WP pipelines.
Query WP API GET /api/repos/2/pipelines?page=1. For each pipeline in state failure, compute finished - started. If duration < 60s, flag as probable infra-flake. Three flagged flakes within a 15-min window → trigger agent restart as in (1) and a bulk-retrigger via POST /api/repos/2/pipelines/<N> for each.

3. grpc error pattern in agent log.
docker logs --since 20m disinto-woodpecker-agent 2>&1 | grep -c 'grpc error' — if ≥3 matches, agent is probably wedged. Trigger restart as in (1).

4. Issue-level guard.
When supervisor detects an agent restart, scan for issues updated in the preceding 30min with label blocked: ci_exhausted and for each one:

unassign + remove blocked label (return to pool)
comment on the issue: "CI agent was unhealthy between HH:MM and HH:MM — prior 3/3 retry budget may have been spent on infra flake, not real failures. Re-queueing for a fresh attempt."
retrigger the PR's latest WP pipeline

This last step is the key correction: ci_exhausted preceded by WP-agent-unhealth = false positive; return to pool with context.

Why this matters for the migration

Between now and cutover every WP CI flake that silently exhausts an agent's budget steals hours of clock time. Without an automatic recovery path, the pace of the step-N backlogs falls off a cliff the moment the agent next goes unhealthy — and it will go unhealthy again (Codeberg #813 is not fixed upstream yet).

Fix for this specific incident (already applied manually)

Restarted disinto-woodpecker-agent.
Closed PR #859 (kept branch fix/issue-842 at 64080232).
Unassigned dev-qwen from #842, removed blocked label, appended prior-art section + pipeline #966 test-#6 failure details to issue body so the next claimant starts with full context.

Non-goals

Not trying to fix Codeberg #813 itself (upstream gRPC-in-nested-docker issue).
Not trying to fix pr-lifecycle's budget logic — the supervisor-side detection is cheaper and more robust than per-issue budget changes.

Labels / meta

bug-report + supervisor-focused. Classify severity as blocker for the migration cadence (not for factory day-to-day — it only bites when an unfixable-by-dev issue hits the budget).

## Incident **2026-04-16 ~10:55–11:52 UTC.** Woodpecker CI agent (`disinto-woodpecker-agent`) entered a repeated gRPC-error crashloop (Codeberg #813 class — gRPC-in-nested-docker). Every workflow it accepted exited 1 within seconds, never actually running pipeline steps. **Blast radius:** dev-qwen took issue #842 at 10:55, opened PR #859, and burned its full 3-attempt `pr-lifecycle` CI-fix budget between 10:55 and 11:08 reacting to these infra-flake "CI failures." Each failure arrived in ~30–60 seconds, too fast to be a real test run. After exhausting the budget, dev-qwen marked #842 as `blocked: ci_exhausted` and moved on. No real bug was being detected; the real failure surfaced later only after an operator restarted the WP agent and manually retriggered pipeline #966 — which then returned a legitimate `bats-init-nomad` failure in test #6 (different issue). **Root cause of the infra-flake:** gRPC-in-nested-docker bug, Woodpecker server ↔ agent comms inside nested containers. Known-flaky; restart of `disinto-woodpecker-agent` clears it. **Recovery:** operator `docker restart disinto-woodpecker-agent` + retrigger pipelines via WP API POST `/api/repos/2/pipelines/<N>`. Fresh run reached real stage signal. ## Why this burned dev-qwen's budget `pr-lifecycle`'s CI-fix budget treats every failed commit-status as a signal to invoke the agent. It has no notion of "infra flake" vs. "real test failure" and no heuristic to distinguish them. Four infra-flake failures in 13 minutes looked identical to four real code-bug failures. ## Suggestions — what supervisor can check every 20min Supervisor runs every `1200s` already. Add these probes: **1. WP agent container health.** ``` docker inspect disinto-woodpecker-agent --format '{{.State.Health.Status}}' ``` If `unhealthy` for the second consecutive supervisor tick → **restart it automatically + post a comment on any currently-running dev-bot/dev-qwen issues warning "CI agent was restarted; subsequent failures before this marker may be infra-flake."** **2. Fast-failure heuristic on WP pipelines.** Query WP API `GET /api/repos/2/pipelines?page=1`. For each pipeline in state `failure`, compute `finished - started`. If duration < 60s, flag as probable infra-flake. Three flagged flakes within a 15-min window → trigger agent restart as in (1) and a bulk-retrigger via POST `/api/repos/2/pipelines/<N>` for each. **3. grpc error pattern in agent log.** `docker logs --since 20m disinto-woodpecker-agent 2>&1 | grep -c 'grpc error'` — if ≥3 matches, agent is probably wedged. Trigger restart as in (1). **4. Issue-level guard.** When supervisor detects an agent restart, scan for issues updated in the preceding 30min with label `blocked: ci_exhausted` and for each one: - unassign + remove `blocked` label (return to pool) - comment on the issue: *"CI agent was unhealthy between HH:MM and HH:MM — prior 3/3 retry budget may have been spent on infra flake, not real failures. Re-queueing for a fresh attempt."* - retrigger the PR's latest WP pipeline This last step is the key correction: **`ci_exhausted` preceded by WP-agent-unhealth = false positive; return to pool with context.** ## Why this matters for the migration Between now and cutover every WP CI flake that silently exhausts an agent's budget steals hours of clock time. Without an automatic recovery path, the pace of the step-N backlogs falls off a cliff the moment the agent next goes unhealthy — and it *will* go unhealthy again (Codeberg #813 is not fixed upstream yet). ## Fix for this specific incident (already applied manually) - Restarted `disinto-woodpecker-agent`. - Closed PR #859 (kept branch `fix/issue-842` at `64080232`). - Unassigned dev-qwen from #842, removed `blocked` label, appended prior-art section + pipeline #966 test-#6 failure details to issue body so the next claimant starts with full context. ## Non-goals - Not trying to fix Codeberg #813 itself (upstream gRPC-in-nested-docker issue). - Not trying to fix `pr-lifecycle`'s budget logic — the supervisor-side detection is cheaper and more robust than per-issue budget changes. ## Labels / meta - `bug-report` + supervisor-focused. Classify severity as blocker for the migration cadence (not for factory day-to-day — it only bites when an unfixable-by-dev issue hits the budget).