incident: WP gRPC flake burned dev-qwen CI retry budget on #842 (2026-04-16) #867
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#867
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Incident
2026-04-16 ~10:55–11:52 UTC. Woodpecker CI agent (
disinto-woodpecker-agent) entered a repeated gRPC-error crashloop (Codeberg #813 class — gRPC-in-nested-docker). Every workflow it accepted exited 1 within seconds, never actually running pipeline steps.Blast radius: dev-qwen took issue #842 at 10:55, opened PR #859, and burned its full 3-attempt
pr-lifecycleCI-fix budget between 10:55 and 11:08 reacting to these infra-flake "CI failures." Each failure arrived in ~30–60 seconds, too fast to be a real test run. After exhausting the budget, dev-qwen marked #842 asblocked: ci_exhaustedand moved on. No real bug was being detected; the real failure surfaced later only after an operator restarted the WP agent and manually retriggered pipeline #966 — which then returned a legitimatebats-init-nomadfailure in test #6 (different issue).Root cause of the infra-flake: gRPC-in-nested-docker bug, Woodpecker server ↔ agent comms inside nested containers. Known-flaky; restart of
disinto-woodpecker-agentclears it.Recovery: operator
docker restart disinto-woodpecker-agent+ retrigger pipelines via WP API POST/api/repos/2/pipelines/<N>. Fresh run reached real stage signal.Why this burned dev-qwen's budget
pr-lifecycle's CI-fix budget treats every failed commit-status as a signal to invoke the agent. It has no notion of "infra flake" vs. "real test failure" and no heuristic to distinguish them. Four infra-flake failures in 13 minutes looked identical to four real code-bug failures.Suggestions — what supervisor can check every 20min
Supervisor runs every
1200salready. Add these probes:1. WP agent container health.
If
unhealthyfor the second consecutive supervisor tick → restart it automatically + post a comment on any currently-running dev-bot/dev-qwen issues warning "CI agent was restarted; subsequent failures before this marker may be infra-flake."2. Fast-failure heuristic on WP pipelines.
Query WP API
GET /api/repos/2/pipelines?page=1. For each pipeline in statefailure, computefinished - started. If duration < 60s, flag as probable infra-flake. Three flagged flakes within a 15-min window → trigger agent restart as in (1) and a bulk-retrigger via POST/api/repos/2/pipelines/<N>for each.3. grpc error pattern in agent log.
docker logs --since 20m disinto-woodpecker-agent 2>&1 | grep -c 'grpc error'— if ≥3 matches, agent is probably wedged. Trigger restart as in (1).4. Issue-level guard.
When supervisor detects an agent restart, scan for issues updated in the preceding 30min with label
blocked: ci_exhaustedand for each one:blockedlabel (return to pool)This last step is the key correction:
ci_exhaustedpreceded by WP-agent-unhealth = false positive; return to pool with context.Why this matters for the migration
Between now and cutover every WP CI flake that silently exhausts an agent's budget steals hours of clock time. Without an automatic recovery path, the pace of the step-N backlogs falls off a cliff the moment the agent next goes unhealthy — and it will go unhealthy again (Codeberg #813 is not fixed upstream yet).
Fix for this specific incident (already applied manually)
disinto-woodpecker-agent.fix/issue-842at64080232).blockedlabel, appended prior-art section + pipeline #966 test-#6 failure details to issue body so the next claimant starts with full context.Non-goals
pr-lifecycle's budget logic — the supervisor-side detection is cheaper and more robust than per-issue budget changes.Labels / meta
bug-report+ supervisor-focused. Classify severity as blocker for the migration cadence (not for factory day-to-day — it only bites when an unfixable-by-dev issue hits the budget).