disinto-ops/sprints/vault-blast-radius-tiers.md

77 lines
4.1 KiB
Markdown
Raw Permalink Normal View History

# Sprint: vault blast-radius tiers
## Vision issues
- #419 — Vault: blast-radius based approval tiers
## What this enables
After this sprint, low-tier vault actions execute without waiting for a human. The dispatcher
auto-approves and merges vault PRs classified as `low` in `policy.toml`. Medium and high tiers
are unchanged: medium notifies and allows async review; high blocks until admin approves.
This removes the bottleneck on low-risk bookkeeping operations while preserving the hard gate
on production deploys, secret operations, and agent self-modification.
## What exists today
The tier infrastructure is fully built. Only the enforcement is missing.
- `vault/policy.toml` — Maps every formula to low/medium/high. Current low tier: groom-backlog,
triage, reproduce, review-pr. Medium: dev, run-planner, run-gardener, run-predictor,
run-supervisor, run-architect, upgrade-dependency. High: run-publish-site, run-rent-a-human,
add-rpc-method, release.
- `vault/classify.sh` — Shell classifier called by `vault-env.sh`. Returns tier for a given formula.
- `vault/SCHEMA.md` — Documents `blast_radius` override field (string: "low"/"medium"/"high")
that vault action TOMLs can use to override policy defaults.
- `vault/validate.sh` — Validates vault action TOML fields including blast_radius.
- `docker/edge/dispatcher.sh` — Edge dispatcher. Polls ops repo for merged vault PRs and executes
them. Currently fires ALL merged vault PRs without tier differentiation.
What's missing: the dispatcher does not read blast_radius, does not auto-approve low-tier PRs,
and does not differentiate notification behavior for medium vs high tier.
## Complexity
Files touched: 3
- `docker/edge/dispatcher.sh` — read blast_radius from vault action TOML; for low tier, call
Forgejo API to approve + merge the PR directly (admin token); for medium, post "pending async
review" comment; for high, leave pending (existing behavior)
- `lib/vault.sh` `vault_request()` — include blast_radius in the PR body so the dispatcher
can read it without re-parsing the TOML
- `docs/VAULT.md` — document the three-tier behavior for operators
Sub-issues: 3
Gluecode ratio: ~70% gluecode (dispatcher reads existing classify.sh output), ~30% new (auto-approve API call, comment logic)
## Risks
- Admin token for auto-approve: the dispatcher needs an admin-level Forgejo token to approve
and merge PRs. Currently `FORGE_TOKEN` is used; branch protection has `admin_enforced: true`
which means even admin bots are blocked from bypassing the approval gate. This is the core
design fork: either (a) relax admin_enforced for low-tier PRs, or (b) use a separate
Forgejo "auto-approver" account with admin rights, or (c) bypass the PR workflow entirely
for low-tier actions (execute directly without a PR).
- Policy drift: as new formulas are added, policy.toml must be updated. If a formula is missing,
classify.sh should default to "high" (fail safe). Currently the default behavior is unknown —
this needs to be hardened.
- Audit trail: low-tier auto-approvals should still leave a record. Auto-approve comment
("auto-approved: low blast radius") satisfies this.
## Cost — new infra to maintain
- One new Forgejo account or token (if auto-approver route chosen) — needs rotation policy
- `policy.toml` maintenance: every new formula must be classified before shipping
- No new services, cron jobs, or containers
## Recommendation
Worth it, but the design fork on auto-approve mechanism must be resolved before implementation
begins — this is the questions step.
The cleanest approach is option (c): bypass the PR workflow for low-tier actions entirely.
The dispatcher detects blast_radius=low, executes the formula immediately without creating
a PR, and writes to `vault/fired/` directly. This avoids the admin token problem, preserves
the PR workflow for medium/high, and keeps the audit trail in git. However, it changes the
blast_radius=low behavior from "PR exists but auto-merges" to "no PR, just executes" — operators
need to understand the difference.
The PR route (option b) is more visible but requires a dedicated account.