architect: vault blast-radius tiers #9

Merged
disinto-admin merged 2 commits from architect/vault-blast-radius-tiers into main 2026-04-08 19:08:05 +00:00
Showing only changes of commit 7afe8bf145 - Show all commits

View file

@ -0,0 +1,87 @@
# Sprint: Vault blast-radius tiers
## Vision issues
- #419 — Vault: blast-radius based approval tiers
## What this enables
After this sprint, vault operations are classified by blast radius — low-risk operations
(docs, feature-branch edits) flow through without human gating; medium-risk operations
(CI config, Dockerfile changes) queue for async review; high-risk operations (production
deploys, secrets rotation, agent self-modification) hard-block as today.
The practical effect: the dev loop no longer stalls waiting for human approval of routine
operations. Agents can move autonomously through 80%+ of vault requests while preserving
the safety contract on irreversible operations.
## What exists today
The vault redesign (#73-#77) is complete and all five issues are closed:
- lib/vault.sh - idempotent vault PR creation via Forgejo API
- docker/edge/dispatcher.sh - polls merged vault PRs, verifies admin approval, launches runners
- vault/vault-env.sh - TOML validation for vault action files
- vault/SCHEMA.md - vault action TOML schema
- lib/branch-protection.sh - admin-only merge enforcement on ops repo
Currently every vault request goes through the same hard-block path regardless of risk.
No classification layer exists. All formulas share the same single approval tier.
Note: prerequisites.md says vault redesign is incomplete - this is stale. All #73-#77
issues are closed as of the current state.
## Complexity
Files touched: ~14 (7 new, 7 modified)
New files:
- vault/classify.sh - classification engine, ~150 lines: path glob matching, secret risk scoring, formula risk lookup
- vault/policy.toml - human-editable policy rules mapping path patterns to tier assignments
- vault/formula-risks.toml - formula-to-risk-level mapping (release=high, gardener=low)
- docs/BLAST-RADIUS.md - documentation
- 2-3 test helpers
Modified files:
- vault/SCHEMA.md - optional blast_radius override field
- vault/vault-env.sh - call classify.sh, validate classification
- lib/vault.sh - attach computed tier as PR label; skip PR creation for auto-approve tier
- docker/edge/dispatcher.sh - enforce tier policy: auto-merge low, route medium, hard-block high
- lib/branch-protection.sh - potentially vary required-approvals by tier
Gluecode vs greenfield: ~60% gluecode, ~40% greenfield (classification engine and policy format).
Estimated sub-issues: 4-5
## Risks
1. Classification errors on consequential operations. Classification is deterministic
(pattern matching, not AI judgment), but a misconfigured policy.toml could auto-approve
something that should hard-block. Mitigation: default-deny all unknown patterns; policy
changes require human review.
2. Dispatcher complexity. dispatcher.sh is already 1005 lines. Adding three code paths adds
~150 lines. Mitigation: extract classification to classify.sh so dispatcher delegates,
not decides.
3. Branch-protection interaction. Auto-approve tier means the dispatcher merges without human
approval. branch-protection.sh currently requires 1 approval; the dispatcher must bypass
this for auto-approve tier. Requires admin token in vault runner, or branch protection must
become tier-aware. This is the primary design fork.
4. Stale prerequisites.md. Should be updated as part of execution.
## Cost - new infra to maintain
- vault/policy.toml - operators must keep current as new formulas are added. Unknown formulas
default to HIGH (safe, forces manual approval).
- vault/classify.sh - one shell script, shellcheck-covered, no runtime daemon.
- No new services, cron jobs, or agent roles.
Ongoing cost is low.
## Recommendation
Worth it. The vault redesign is done; blast-radius tiers are the logical next step to make
it usable in practice. The bottleneck today forces human approval on every vault action,
which is the primary reason agents cannot operate continuously. This sprint has clear scope
(~14 files, 4-5 sub-issues), low new maintenance cost, and directly unblocks autonomous
operation in the Foundation phase.
The branch-protection and admin-token interaction (Risk 3) is the only design fork worth
resolving before implementation. Everything else is straightforward.
---
Reply ACCEPT to proceed with design questions, or REJECT: reason to decline.