vision: CI circuit breaker — global trust signal that pauses agents when measurement is broken #1139

Open
opened 2026-04-21 16:38:02 +00:00 by dev-bot · 1 comment
Collaborator

Architectural response to the 2026-04-21 CI chaos-monkey cascade (Codeberg #843). Long-horizon; filed as vision, not a near-term task.

Motivation

The central observation from the 2026-04-21 incident: every agent treats CI as ground truth, but CI is a measurement apparatus with its own failure modes. When the apparatus breaks, consumers have no fallback and no shared signal to pause gracefully. Today each agent has ad-hoc local heuristics (dev-poll's ci_exhausted fast-path #1047, reviewer's combined-status gate, supervisor's probe ideas sketched in #867/#894), and these heuristics don't coordinate.

During the incident, dev-poll kept trying to open new PRs, reviewer kept refusing to merge, supervisor was dead — no component knew the measurement was broken. Recovery required a human noticing the queue depth and force-merging hotpatches.

Proposal

One globally shared state, written by one component, subscribed to by all:

data/ci_trust.json
{
  "level": "trusted|degraded|untrusted",
  "reason": "<short string>",
  "since": "2026-04-21T11:17Z",
  "evidence": {
    "provider_health": "unhealthy",
    "infra_flake_rate_15m": 0.42,
    "queue_depth": 32,
    "oldest_pending_min": 55
  },
  "next_reeval": "2026-04-21T11:22Z"
}

State transitions

  • trusted → degraded when any of:
    • infra-flake verdict rate > 20% in rolling 15 min window
    • queue depth > 3× worker count for >5 min
    • oldest pending pipeline >30 min
  • degraded → untrusted when any of:
    • provider-down verdict from classifier (requires vision: CI adapter)
    • infra-flake rate > 50% in rolling 15 min
    • no pipelines have finished in the last 15 min despite active queue
  • untrusted → degraded → trusted only on successful recovery probes (provider health green + N consecutive clean pipelines).

Subscribers

  • dev-poll on untrusted: pause new issue claims, suspend in-flight PR iteration. On degraded: continue existing PRs but hold new ones.
  • reviewer-agent on degraded: refuse merges (including force_merge). On untrusted: pause entirely.
  • supervisor on degraded / untrusted: flip into recovery mode — run provider restart, drain stuck queue, retrigger after cooldown (this subsumes the #867 auto-restart and #894 unblock-sweeper proposals into a single flow).
  • dispatcher on untrusted: refuse to dispatch new workflows; queue them for replay after recovery.
  • humans via a banner in the canvas / dashboard — clear signal the factory is in protected mode.

Writer

Exactly one component owns the file. Options:

  • (A) New ci-health-daemon inside disinto-agents container, polling every 30s.
  • (B) Extension of supervisor's role. Lighter, but supervisor itself has been unreliable (#1120); the writer must be resilient independently (see companion vision issue on heartbeat watchdog).
  • (C) Separate tiny script invoked from the polling loop, same pattern as dev-poll / review-poll.

Decision deferred until adapter+classifier (the other vision issue) lands — the classifier produces the verdicts this daemon aggregates.

Why this is vision and not backlog

  • Depends on the CI adapter + signal classifier (separate vision issue) for the verdicts it aggregates.
  • Requires every long-lived agent to grow a subscribe-and-honour loop — coordinated change across dev/review/supervisor/dispatcher.
  • Needs threshold tuning against historical incident data we don't yet collect systematically.
  • Must not itself become a SPOF — the writer needs heartbeat monitoring, same pattern as the supervisor watchdog discussion.

Non-goals

  • Making the circuit breaker smarter than the classifier. This issue is aggregation + distribution, not diagnosis.
  • Replacing branch protection or Forgejo's native CI gating. Branch protection is the hard floor; the circuit breaker is an upstream soft pause to keep agents from piling up behind a broken measurement layer.
  • Perfect auto-recovery. An untrusted state that requires human confirmation to clear is acceptable for v1; auto-clear is a later refinement.

Incremental steps when work begins

  1. Define the JSON schema and write a stub writer that emits trusted always (just for subscribers to read).
  2. Add subscribe-check in one consumer (reviewer-agent is cheapest — one merge-gate call).
  3. Expand to dev-poll, then supervisor, then dispatcher.
  4. Replace stub writer with real threshold evaluator once classifier verdicts exist.
  5. Chaos drill: kill the CI provider, verify state goes untrusted, verify all agents honour it, verify recovery after provider comes back.
  • Codeberg #843 — incident writeup motivating this
  • #867 — supervisor auto-restart on provider unhealth (becomes the untrusted → recovery action path)
  • #894 — unblock-PR sweeper (becomes consumer action under degraded)
  • #1120 — supervisor silent death; informs the "writer cannot itself be a SPOF" constraint
  • vision companion: "CI provider adapter + signal classifier" (produces the verdicts this breaker consumes)
Architectural response to the 2026-04-21 CI chaos-monkey cascade (Codeberg #843). Long-horizon; filed as vision, not a near-term task. ## Motivation The central observation from the 2026-04-21 incident: **every agent treats CI as ground truth, but CI is a measurement apparatus with its own failure modes.** When the apparatus breaks, consumers have no fallback and no shared signal to pause gracefully. Today each agent has ad-hoc local heuristics (dev-poll's `ci_exhausted` fast-path #1047, reviewer's combined-status gate, supervisor's probe ideas sketched in #867/#894), and these heuristics don't coordinate. During the incident, dev-poll kept trying to open new PRs, reviewer kept refusing to merge, supervisor was dead — no component knew the measurement was broken. Recovery required a human noticing the queue depth and force-merging hotpatches. ## Proposal One globally shared state, written by one component, subscribed to by all: ``` data/ci_trust.json { "level": "trusted|degraded|untrusted", "reason": "<short string>", "since": "2026-04-21T11:17Z", "evidence": { "provider_health": "unhealthy", "infra_flake_rate_15m": 0.42, "queue_depth": 32, "oldest_pending_min": 55 }, "next_reeval": "2026-04-21T11:22Z" } ``` ### State transitions - **trusted → degraded** when any of: - infra-flake verdict rate > 20% in rolling 15 min window - queue depth > 3× worker count for >5 min - oldest pending pipeline >30 min - **degraded → untrusted** when any of: - provider-down verdict from classifier (requires vision: CI adapter) - infra-flake rate > 50% in rolling 15 min - no pipelines have finished in the last 15 min despite active queue - **untrusted → degraded → trusted** only on successful recovery probes (provider health green + N consecutive clean pipelines). ### Subscribers - **dev-poll** on `untrusted`: pause new issue claims, suspend in-flight PR iteration. On `degraded`: continue existing PRs but hold new ones. - **reviewer-agent** on `degraded`: refuse merges (including force_merge). On `untrusted`: pause entirely. - **supervisor** on `degraded` / `untrusted`: flip into recovery mode — run provider restart, drain stuck queue, retrigger after cooldown (this subsumes the #867 auto-restart and #894 unblock-sweeper proposals into a single flow). - **dispatcher** on `untrusted`: refuse to dispatch new workflows; queue them for replay after recovery. - **humans** via a banner in the canvas / dashboard — clear signal the factory is in protected mode. ### Writer Exactly one component owns the file. Options: - (A) New `ci-health-daemon` inside `disinto-agents` container, polling every 30s. - (B) Extension of supervisor's role. Lighter, but supervisor itself has been unreliable (#1120); the writer must be resilient independently (see companion vision issue on heartbeat watchdog). - (C) Separate tiny script invoked from the polling loop, same pattern as dev-poll / review-poll. Decision deferred until adapter+classifier (the other vision issue) lands — the classifier produces the verdicts this daemon aggregates. ## Why this is vision and not backlog - Depends on the CI adapter + signal classifier (separate vision issue) for the verdicts it aggregates. - Requires every long-lived agent to grow a subscribe-and-honour loop — coordinated change across dev/review/supervisor/dispatcher. - Needs threshold tuning against historical incident data we don't yet collect systematically. - Must not itself become a SPOF — the writer needs heartbeat monitoring, same pattern as the supervisor watchdog discussion. ## Non-goals - Making the circuit breaker smarter than the classifier. This issue is aggregation + distribution, not diagnosis. - Replacing branch protection or Forgejo's native CI gating. Branch protection is the hard floor; the circuit breaker is an upstream soft pause to keep agents from piling up behind a broken measurement layer. - Perfect auto-recovery. An `untrusted` state that requires human confirmation to clear is acceptable for v1; auto-clear is a later refinement. ## Incremental steps when work begins 1. Define the JSON schema and write a stub writer that emits `trusted` always (just for subscribers to read). 2. Add subscribe-check in one consumer (reviewer-agent is cheapest — one merge-gate call). 3. Expand to dev-poll, then supervisor, then dispatcher. 4. Replace stub writer with real threshold evaluator once classifier verdicts exist. 5. Chaos drill: kill the CI provider, verify state goes `untrusted`, verify all agents honour it, verify recovery after provider comes back. ## Related - Codeberg #843 — incident writeup motivating this - #867 — supervisor auto-restart on provider unhealth (becomes the `untrusted → recovery` action path) - #894 — unblock-PR sweeper (becomes consumer action under `degraded`) - #1120 — supervisor silent death; informs the "writer cannot itself be a SPOF" constraint - vision companion: "CI provider adapter + signal classifier" (produces the verdicts this breaker consumes)
dev-bot added the
vision
label 2026-04-21 16:38:02 +00:00
Author
Collaborator

Scope-compression note (2026-04-21 vision-pass review):

This vision can ship v1 without #1138 (CI adapter + classifier) in place. The dependency stated in "Why this is vision and not backlog" — "Depends on the CI adapter + signal classifier for the verdicts it aggregates" — holds for the fully-generalized form but is not required for the minimum viable circuit breaker.

v1 stub path (ships alone, Woodpecker-only):

  1. Writer reads directly from Woodpecker REST + SQLite (same access pattern supervisor already has) and emits data/ci_trust.json with the three-state schema.
  2. Thresholds evaluated against raw pipeline records, not classifier verdicts: infra_flake_rate_15m is computed from failure events with duration <60s (the #867 signature); queue_depth from pending pipelines; oldest_pending_min direct from API.
  3. Subscribe-and-honour wired into reviewer-agent only as the v1 consumer — one merge-gate call, minimal risk, and reviewer is where merging-while-CI-is-lying did the worst damage on 2026-04-21.
  4. Human-cleared untrusted → trusted transition for v1. Auto-clear is v2.

This buys ~90% of the 2026-04-21 incident-class mitigation without the multi-provider lift in #1138. If #1138 never ships, this vision still delivers value. If #1138 does ship, this issue's writer gets rewritten to consume classifier verdicts — but the subscribers don't notice because the output schema stays the same.

Suggested acceptance addition:

  • Chaos drill where Woodpecker is killed mid-queue and reviewer-agent pauses merges within one polling cycle.

Marking #1138 as not-blocking this vision. Updating the "Why this is vision and not backlog" line accordingly is optional — leaving it as is if we want to preserve the full architectural intent.

Scope-compression note (2026-04-21 vision-pass review): This vision can ship v1 **without** #1138 (CI adapter + classifier) in place. The dependency stated in "Why this is vision and not backlog" — *"Depends on the CI adapter + signal classifier for the verdicts it aggregates"* — holds for the fully-generalized form but is not required for the minimum viable circuit breaker. **v1 stub path** (ships alone, Woodpecker-only): 1. Writer reads directly from Woodpecker REST + SQLite (same access pattern supervisor already has) and emits `data/ci_trust.json` with the three-state schema. 2. Thresholds evaluated against raw pipeline records, not classifier verdicts: `infra_flake_rate_15m` is computed from `failure` events with duration <60s (the #867 signature); `queue_depth` from pending pipelines; `oldest_pending_min` direct from API. 3. Subscribe-and-honour wired into **reviewer-agent only** as the v1 consumer — one merge-gate call, minimal risk, and reviewer is where merging-while-CI-is-lying did the worst damage on 2026-04-21. 4. Human-cleared `untrusted → trusted` transition for v1. Auto-clear is v2. This buys ~90% of the 2026-04-21 incident-class mitigation without the multi-provider lift in #1138. If #1138 never ships, this vision still delivers value. If #1138 does ship, this issue's writer gets rewritten to consume classifier verdicts — but the subscribers don't notice because the output schema stays the same. **Suggested acceptance addition:** - [ ] Chaos drill where Woodpecker is killed mid-queue and reviewer-agent pauses merges within one polling cycle. Marking #1138 as not-blocking this vision. Updating the "Why this is vision and not backlog" line accordingly is optional — leaving it as is if we want to preserve the full architectural intent.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1139
No description provided.