Sprint: supervisor Docker storage telemetry

Vision issues

#545 — supervisor should detect Docker btrfs subvolume usage explicitly, not rely solely on df -h /

What this enables

After this sprint, the supervisor knows where disk pressure comes from — not just that it exists. When disk hits 80%, the supervisor can distinguish between "Docker images are bloated" vs "build cache is huge" vs "volumes are growing" vs "btrfs metadata overhead" and take the right remediation action. Trend-aware journaling lets the supervisor detect patterns across runs ("this image rebuilt 12 times today, 8 GB dangling layers") and escalate proactively before P1 thresholds are crossed.

What exists today

Supervisor runs every 20 min in the edge container via supervisor-run.sh -> preflight.sh -> Claude formula
Edge container has Docker socket (/var/run/docker.sock mount) and root access
docker-cli installed (Alpine apk add docker-cli)
Disk check: df -h / | awk 'NR==2{print $5}' — filesystem-level only, no Docker breakdown
P1 remediation: docker system prune -f, then -a -f if still >80%
Preflight framework is structured text with ## Section headers — easy to extend
Journal writing already appends per-run findings to daily markdown files
Vault filing for unresolved issues already in place

All infrastructure needed for this sprint is already built. This is pure extension of existing capabilities.

Complexity

3 files touched: supervisor/preflight.sh (new section), formulas/run-supervisor.toml (enhanced P1 logic), possibly a small helper
3-4 sub-issues estimated
Gluecode ratio: ~90% — calling docker system df, parsing JSON output, adding conditional branches for storage driver
Greenfield: ~10% — btrfs-specific detection logic (if btrfs driver detected)

Risks

btrfs tools not in container image: btrfs filesystem df, compsize etc. require btrfs-progs package. May need an apk add in the Dockerfile or graceful degradation when tools are absent.
docker system df -v can be slow: On systems with hundreds of images, verbose output takes seconds. Preflight already runs in a time-bounded context, but worth watching.
btrfs CoW sharing makes size reporting unreliable: Apparent size != exclusive size. The sprint should report both where available, but document the caveat clearly.
Not all deployments use btrfs: The solution must work with overlay2 (the common case) and degrade gracefully — btrfs-specific telemetry is additive, not required.

Cost — new infra to maintain

No new services, cron jobs, or containers — extends existing supervisor preflight
No new formulas — extends existing run-supervisor.toml
Minimal ongoing cost: if Docker changes docker system df output format (unlikely), the parser needs updating. Otherwise maintenance-free.
One conditional branch in remediation logic (storage-driver-aware cleanup) — adds ~20 lines to the formula

Recommendation

Worth it. This sprint addresses a real production incident (harb-dev-box 98% disk with no visible culprit), the fix is low-risk gluecode extending well-tested infrastructure, and it adds zero ongoing maintenance burden. The information gap is the root cause of blind remediation — the supervisor currently prunes and hopes. With Docker storage telemetry, it can make informed decisions and escalate intelligently.

The btrfs-specific parts should degrade gracefully when tools are absent, making this useful on all storage drivers while providing extra resolution on btrfs deployments.

3.6 KiB Raw Blame History