sprint: add vault-blast-radius-tiers.md

2026-04-08 10:28:17 +00:00 · 2026-04-08 10:28:17 +00:00 · 7afe8bf145
commit 7afe8bf145
parent d033dff89e
1 changed files with 87 additions and 0 deletions
--- a/sprints/vault-blast-radius-tiers.md
+++ b/sprints/vault-blast-radius-tiers.md
@ -0,0 +1,87 @@
+# Sprint: Vault blast-radius tiers
+
+## Vision issues
+- #419 — Vault: blast-radius based approval tiers
+
+## What this enables
+After this sprint, vault operations are classified by blast radius — low-risk operations
+(docs, feature-branch edits) flow through without human gating; medium-risk operations
+(CI config, Dockerfile changes) queue for async review; high-risk operations (production
+deploys, secrets rotation, agent self-modification) hard-block as today.
+
+The practical effect: the dev loop no longer stalls waiting for human approval of routine
+operations. Agents can move autonomously through 80%+ of vault requests while preserving
+the safety contract on irreversible operations.
+
+## What exists today
+The vault redesign (#73-#77) is complete and all five issues are closed:
+- lib/vault.sh - idempotent vault PR creation via Forgejo API
+- docker/edge/dispatcher.sh - polls merged vault PRs, verifies admin approval, launches runners
+- vault/vault-env.sh - TOML validation for vault action files
+- vault/SCHEMA.md - vault action TOML schema
+- lib/branch-protection.sh - admin-only merge enforcement on ops repo
+
+Currently every vault request goes through the same hard-block path regardless of risk.
+No classification layer exists. All formulas share the same single approval tier.
+
+Note: prerequisites.md says vault redesign is incomplete - this is stale. All #73-#77
+issues are closed as of the current state.
+
+## Complexity
+Files touched: ~14 (7 new, 7 modified)
+
+New files:
+- vault/classify.sh - classification engine, ~150 lines: path glob matching, secret risk scoring, formula risk lookup
+- vault/policy.toml - human-editable policy rules mapping path patterns to tier assignments
+- vault/formula-risks.toml - formula-to-risk-level mapping (release=high, gardener=low)
+- docs/BLAST-RADIUS.md - documentation
+- 2-3 test helpers
+
+Modified files:
+- vault/SCHEMA.md - optional blast_radius override field
+- vault/vault-env.sh - call classify.sh, validate classification
+- lib/vault.sh - attach computed tier as PR label; skip PR creation for auto-approve tier
+- docker/edge/dispatcher.sh - enforce tier policy: auto-merge low, route medium, hard-block high
+- lib/branch-protection.sh - potentially vary required-approvals by tier
+
+Gluecode vs greenfield: ~60% gluecode, ~40% greenfield (classification engine and policy format).
+
+Estimated sub-issues: 4-5
+
+## Risks
+1. Classification errors on consequential operations. Classification is deterministic
+   (pattern matching, not AI judgment), but a misconfigured policy.toml could auto-approve
+   something that should hard-block. Mitigation: default-deny all unknown patterns; policy
+   changes require human review.
+
+2. Dispatcher complexity. dispatcher.sh is already 1005 lines. Adding three code paths adds
+   ~150 lines. Mitigation: extract classification to classify.sh so dispatcher delegates,
+   not decides.
+
+3. Branch-protection interaction. Auto-approve tier means the dispatcher merges without human
+   approval. branch-protection.sh currently requires 1 approval; the dispatcher must bypass
+   this for auto-approve tier. Requires admin token in vault runner, or branch protection must
+   become tier-aware. This is the primary design fork.
+
+4. Stale prerequisites.md. Should be updated as part of execution.
+
+## Cost - new infra to maintain
+- vault/policy.toml - operators must keep current as new formulas are added. Unknown formulas
+  default to HIGH (safe, forces manual approval).
+- vault/classify.sh - one shell script, shellcheck-covered, no runtime daemon.
+- No new services, cron jobs, or agent roles.
+
+Ongoing cost is low.
+
+## Recommendation
+Worth it. The vault redesign is done; blast-radius tiers are the logical next step to make
+it usable in practice. The bottleneck today forces human approval on every vault action,
+which is the primary reason agents cannot operate continuously. This sprint has clear scope
+(~14 files, 4-5 sub-issues), low new maintenance cost, and directly unblocks autonomous
+operation in the Foundation phase.
+
+The branch-protection and admin-token interaction (Risk 3) is the only design fork worth
+resolving before implementation. Everything else is straightforward.
+
+---
+Reply ACCEPT to proceed with design questions, or REJECT: reason to decline.