disinto-ops/sprints/supervisor-docker-storage.md

3.6 KiB

Sprint: supervisor Docker storage telemetry

Vision issues

  • #545 — supervisor should detect Docker btrfs subvolume usage explicitly, not rely solely on df -h /

What this enables

After this sprint, the supervisor knows where disk pressure comes from — not just that it exists. When disk hits 80%, the supervisor can distinguish between "Docker images are bloated" vs "build cache is huge" vs "volumes are growing" vs "btrfs metadata overhead" and take the right remediation action. Trend-aware journaling lets the supervisor detect patterns across runs ("this image rebuilt 12 times today, 8 GB dangling layers") and escalate proactively before P1 thresholds are crossed.

What exists today

  • Supervisor runs every 20 min in the edge container via supervisor-run.sh -> preflight.sh -> Claude formula
  • Edge container has Docker socket (/var/run/docker.sock mount) and root access
  • docker-cli installed (Alpine apk add docker-cli)
  • Disk check: df -h / | awk 'NR==2{print $5}' — filesystem-level only, no Docker breakdown
  • P1 remediation: docker system prune -f, then -a -f if still >80%
  • Preflight framework is structured text with ## Section headers — easy to extend
  • Journal writing already appends per-run findings to daily markdown files
  • Vault filing for unresolved issues already in place

All infrastructure needed for this sprint is already built. This is pure extension of existing capabilities.

Complexity

  • 3 files touched: supervisor/preflight.sh (new section), formulas/run-supervisor.toml (enhanced P1 logic), possibly a small helper
  • 3-4 sub-issues estimated
  • Gluecode ratio: ~90% — calling docker system df, parsing JSON output, adding conditional branches for storage driver
  • Greenfield: ~10% — btrfs-specific detection logic (if btrfs driver detected)

Risks

  • btrfs tools not in container image: btrfs filesystem df, compsize etc. require btrfs-progs package. May need an apk add in the Dockerfile or graceful degradation when tools are absent.
  • docker system df -v can be slow: On systems with hundreds of images, verbose output takes seconds. Preflight already runs in a time-bounded context, but worth watching.
  • btrfs CoW sharing makes size reporting unreliable: Apparent size != exclusive size. The sprint should report both where available, but document the caveat clearly.
  • Not all deployments use btrfs: The solution must work with overlay2 (the common case) and degrade gracefully — btrfs-specific telemetry is additive, not required.

Cost — new infra to maintain

  • No new services, cron jobs, or containers — extends existing supervisor preflight
  • No new formulas — extends existing run-supervisor.toml
  • Minimal ongoing cost: if Docker changes docker system df output format (unlikely), the parser needs updating. Otherwise maintenance-free.
  • One conditional branch in remediation logic (storage-driver-aware cleanup) — adds ~20 lines to the formula

Recommendation

Worth it. This sprint addresses a real production incident (harb-dev-box 98% disk with no visible culprit), the fix is low-risk gluecode extending well-tested infrastructure, and it adds zero ongoing maintenance burden. The information gap is the root cause of blind remediation — the supervisor currently prunes and hopes. With Docker storage telemetry, it can make informed decisions and escalate intelligently.

The btrfs-specific parts should degrade gracefully when tools are absent, making this useful on all storage drivers while providing extra resolution on btrfs deployments.