vision: CI circuit breaker — global trust signal that pauses agents when measurement is broken #1139
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#1139
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Architectural response to the 2026-04-21 CI chaos-monkey cascade (Codeberg #843). Long-horizon; filed as vision, not a near-term task.
Motivation
The central observation from the 2026-04-21 incident: every agent treats CI as ground truth, but CI is a measurement apparatus with its own failure modes. When the apparatus breaks, consumers have no fallback and no shared signal to pause gracefully. Today each agent has ad-hoc local heuristics (dev-poll's
ci_exhaustedfast-path #1047, reviewer's combined-status gate, supervisor's probe ideas sketched in #867/#894), and these heuristics don't coordinate.During the incident, dev-poll kept trying to open new PRs, reviewer kept refusing to merge, supervisor was dead — no component knew the measurement was broken. Recovery required a human noticing the queue depth and force-merging hotpatches.
Proposal
One globally shared state, written by one component, subscribed to by all:
State transitions
Subscribers
untrusted: pause new issue claims, suspend in-flight PR iteration. Ondegraded: continue existing PRs but hold new ones.degraded: refuse merges (including force_merge). Onuntrusted: pause entirely.degraded/untrusted: flip into recovery mode — run provider restart, drain stuck queue, retrigger after cooldown (this subsumes the #867 auto-restart and #894 unblock-sweeper proposals into a single flow).untrusted: refuse to dispatch new workflows; queue them for replay after recovery.Writer
Exactly one component owns the file. Options:
ci-health-daemoninsidedisinto-agentscontainer, polling every 30s.Decision deferred until adapter+classifier (the other vision issue) lands — the classifier produces the verdicts this daemon aggregates.
Why this is vision and not backlog
Non-goals
untrustedstate that requires human confirmation to clear is acceptable for v1; auto-clear is a later refinement.Incremental steps when work begins
trustedalways (just for subscribers to read).untrusted, verify all agents honour it, verify recovery after provider comes back.Related
untrusted → recoveryaction path)degraded)Scope-compression note (2026-04-21 vision-pass review):
This vision can ship v1 without #1138 (CI adapter + classifier) in place. The dependency stated in "Why this is vision and not backlog" — "Depends on the CI adapter + signal classifier for the verdicts it aggregates" — holds for the fully-generalized form but is not required for the minimum viable circuit breaker.
v1 stub path (ships alone, Woodpecker-only):
data/ci_trust.jsonwith the three-state schema.infra_flake_rate_15mis computed fromfailureevents with duration <60s (the #867 signature);queue_depthfrom pending pipelines;oldest_pending_mindirect from API.untrusted → trustedtransition for v1. Auto-clear is v2.This buys ~90% of the 2026-04-21 incident-class mitigation without the multi-provider lift in #1138. If #1138 never ships, this vision still delivers value. If #1138 does ship, this issue's writer gets rewritten to consume classifier verdicts — but the subscribers don't notice because the output schema stays the same.
Suggested acceptance addition:
Marking #1138 as not-blocking this vision. Updating the "Why this is vision and not backlog" line accordingly is optional — leaving it as is if we want to preserve the full architectural intent.