architect: supervisor Docker storage telemetry #34

Merged
disinto-admin merged 1 commit from architect/supervisor-docker-storage into main 2026-04-15 17:39:20 +00:00
Collaborator

Sprint pitch: supervisor Docker storage telemetry

Vision issue: #545 — supervisor should detect Docker btrfs subvolume usage explicitly

What this enables

The supervisor learns where disk pressure comes from, not just that it exists. Instead of blind docker system prune and hope, it can distinguish Docker images vs build cache vs volumes vs btrfs metadata overhead — and take the right action. Trend-aware journaling detects patterns across runs for proactive escalation.

What exists today

  • Supervisor runs every 20 min in edge container with Docker socket + root access
  • Disk check is df -h / only — no Docker breakdown
  • P1 remediation: docker system prune -f then -a -f
  • Preflight framework, journal writing, and vault filing all ready to extend

Complexity

  • 3 files touched, 3-4 sub-issues, ~90% gluecode
  • Calls docker system df, parses output, adds storage-driver-aware branching

Risks

  • btrfs tools may not be in container image (graceful degradation needed)
  • docker system df -v can be slow on large systems
  • btrfs CoW makes size reporting unreliable (document caveat)

Cost

  • No new services, cron jobs, or containers
  • Extends existing preflight + formula — zero new infrastructure

Recommendation

Worth it. Addresses a real production incident (harb-dev-box 98% disk, 42 GB unaccounted). Low-risk gluecode, zero maintenance burden, makes supervisor remediation informed instead of blind.


Reply ACCEPT to proceed with design questions, or REJECT: <reason> to decline.

## Sprint pitch: supervisor Docker storage telemetry **Vision issue:** #545 — supervisor should detect Docker btrfs subvolume usage explicitly ### What this enables The supervisor learns *where* disk pressure comes from, not just that it exists. Instead of blind `docker system prune` and hope, it can distinguish Docker images vs build cache vs volumes vs btrfs metadata overhead — and take the right action. Trend-aware journaling detects patterns across runs for proactive escalation. ### What exists today - Supervisor runs every 20 min in edge container with Docker socket + root access - Disk check is `df -h /` only — no Docker breakdown - P1 remediation: `docker system prune -f` then `-a -f` - Preflight framework, journal writing, and vault filing all ready to extend ### Complexity - 3 files touched, 3-4 sub-issues, ~90% gluecode - Calls `docker system df`, parses output, adds storage-driver-aware branching ### Risks - btrfs tools may not be in container image (graceful degradation needed) - `docker system df -v` can be slow on large systems - btrfs CoW makes size reporting unreliable (document caveat) ### Cost - No new services, cron jobs, or containers - Extends existing preflight + formula — zero new infrastructure ### Recommendation **Worth it.** Addresses a real production incident (harb-dev-box 98% disk, 42 GB unaccounted). Low-risk gluecode, zero maintenance burden, makes supervisor remediation informed instead of blind. --- Reply `ACCEPT` to proceed with design questions, or `REJECT: <reason>` to decline.
architect-bot added 1 commit 2026-04-15 09:13:29 +00:00
disinto-admin merged commit bb37eaf588 into main 2026-04-15 17:39:20 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto-ops#34
No description provided.