architect: supervisor Docker storage telemetry #34

Merged

disinto-admin merged 1 commit from architect/supervisor-docker-storage into main

2026-04-15 17:39:20 +00:00

architect-bot commented

2026-04-15 09:13:29 +00:00

Collaborator

Sprint pitch: supervisor Docker storage telemetry

Vision issue: #545 — supervisor should detect Docker btrfs subvolume usage explicitly

What this enables

The supervisor learns where disk pressure comes from, not just that it exists. Instead of blind docker system prune and hope, it can distinguish Docker images vs build cache vs volumes vs btrfs metadata overhead — and take the right action. Trend-aware journaling detects patterns across runs for proactive escalation.

What exists today

Supervisor runs every 20 min in edge container with Docker socket + root access
Disk check is df -h / only — no Docker breakdown
P1 remediation: docker system prune -f then -a -f
Preflight framework, journal writing, and vault filing all ready to extend

Complexity

3 files touched, 3-4 sub-issues, ~90% gluecode
Calls docker system df, parses output, adds storage-driver-aware branching

Risks

btrfs tools may not be in container image (graceful degradation needed)
docker system df -v can be slow on large systems
btrfs CoW makes size reporting unreliable (document caveat)

Cost

No new services, cron jobs, or containers
Extends existing preflight + formula — zero new infrastructure

Recommendation

Worth it. Addresses a real production incident (harb-dev-box 98% disk, 42 GB unaccounted). Low-risk gluecode, zero maintenance burden, makes supervisor remediation informed instead of blind.

Reply ACCEPT to proceed with design questions, or REJECT: <reason> to decline.

## Sprint pitch: supervisor Docker storage telemetry **Vision issue:** #545 — supervisor should detect Docker btrfs subvolume usage explicitly ### What this enables The supervisor learns *where* disk pressure comes from, not just that it exists. Instead of blind `docker system prune` and hope, it can distinguish Docker images vs build cache vs volumes vs btrfs metadata overhead — and take the right action. Trend-aware journaling detects patterns across runs for proactive escalation. ### What exists today - Supervisor runs every 20 min in edge container with Docker socket + root access - Disk check is `df -h /` only — no Docker breakdown - P1 remediation: `docker system prune -f` then `-a -f` - Preflight framework, journal writing, and vault filing all ready to extend ### Complexity - 3 files touched, 3-4 sub-issues, ~90% gluecode - Calls `docker system df`, parses output, adds storage-driver-aware branching ### Risks - btrfs tools may not be in container image (graceful degradation needed) - `docker system df -v` can be slow on large systems - btrfs CoW makes size reporting unreliable (document caveat) ### Cost - No new services, cron jobs, or containers - Extends existing preflight + formula — zero new infrastructure ### Recommendation **Worth it.** Addresses a real production incident (harb-dev-box 98% disk, 42 GB unaccounted). Low-risk gluecode, zero maintenance burden, makes supervisor remediation informed instead of blind. --- Reply `ACCEPT` to proceed with design questions, or `REJECT: <reason>` to decline.

architect-bot added 1 commit 2026-04-15 09:13:29 +00:00