disinto-ops/sprints/vault-blast-radius-tiers.md

4.1 KiB

Sprint: vault blast-radius tiers

Vision issues

  • #419 — Vault: blast-radius based approval tiers

What this enables

After this sprint, low-tier vault actions execute without waiting for a human. The dispatcher auto-approves and merges vault PRs classified as low in policy.toml. Medium and high tiers are unchanged: medium notifies and allows async review; high blocks until admin approves.

This removes the bottleneck on low-risk bookkeeping operations while preserving the hard gate on production deploys, secret operations, and agent self-modification.

What exists today

The tier infrastructure is fully built. Only the enforcement is missing.

  • vault/policy.toml — Maps every formula to low/medium/high. Current low tier: groom-backlog, triage, reproduce, review-pr. Medium: dev, run-planner, run-gardener, run-predictor, run-supervisor, run-architect, upgrade-dependency. High: run-publish-site, run-rent-a-human, add-rpc-method, release.
  • vault/classify.sh — Shell classifier called by vault-env.sh. Returns tier for a given formula.
  • vault/SCHEMA.md — Documents blast_radius override field (string: "low"/"medium"/"high") that vault action TOMLs can use to override policy defaults.
  • vault/validate.sh — Validates vault action TOML fields including blast_radius.
  • docker/edge/dispatcher.sh — Edge dispatcher. Polls ops repo for merged vault PRs and executes them. Currently fires ALL merged vault PRs without tier differentiation.

What's missing: the dispatcher does not read blast_radius, does not auto-approve low-tier PRs, and does not differentiate notification behavior for medium vs high tier.

Complexity

Files touched: 3

  • docker/edge/dispatcher.sh — read blast_radius from vault action TOML; for low tier, call Forgejo API to approve + merge the PR directly (admin token); for medium, post "pending async review" comment; for high, leave pending (existing behavior)
  • lib/vault.sh vault_request() — include blast_radius in the PR body so the dispatcher can read it without re-parsing the TOML
  • docs/VAULT.md — document the three-tier behavior for operators

Sub-issues: 3 Gluecode ratio: ~70% gluecode (dispatcher reads existing classify.sh output), ~30% new (auto-approve API call, comment logic)

Risks

  • Admin token for auto-approve: the dispatcher needs an admin-level Forgejo token to approve and merge PRs. Currently FORGE_TOKEN is used; branch protection has admin_enforced: true which means even admin bots are blocked from bypassing the approval gate. This is the core design fork: either (a) relax admin_enforced for low-tier PRs, or (b) use a separate Forgejo "auto-approver" account with admin rights, or (c) bypass the PR workflow entirely for low-tier actions (execute directly without a PR).
  • Policy drift: as new formulas are added, policy.toml must be updated. If a formula is missing, classify.sh should default to "high" (fail safe). Currently the default behavior is unknown — this needs to be hardened.
  • Audit trail: low-tier auto-approvals should still leave a record. Auto-approve comment ("auto-approved: low blast radius") satisfies this.

Cost — new infra to maintain

  • One new Forgejo account or token (if auto-approver route chosen) — needs rotation policy
  • policy.toml maintenance: every new formula must be classified before shipping
  • No new services, cron jobs, or containers

Recommendation

Worth it, but the design fork on auto-approve mechanism must be resolved before implementation begins — this is the questions step.

The cleanest approach is option (c): bypass the PR workflow for low-tier actions entirely. The dispatcher detects blast_radius=low, executes the formula immediately without creating a PR, and writes to vault/fired/ directly. This avoids the admin token problem, preserves the PR workflow for medium/high, and keeps the audit trail in git. However, it changes the blast_radius=low behavior from "PR exists but auto-merges" to "no PR, just executes" — operators need to understand the difference.

The PR route (option b) is more visible but requires a dedicated account.