architect: supervisor Docker storage telemetry #34
1 changed files with 48 additions and 0 deletions
48
sprints/supervisor-docker-storage.md
Normal file
48
sprints/supervisor-docker-storage.md
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
# Sprint: supervisor Docker storage telemetry
|
||||
|
||||
## Vision issues
|
||||
- #545 — supervisor should detect Docker btrfs subvolume usage explicitly, not rely solely on `df -h /`
|
||||
|
||||
## What this enables
|
||||
|
||||
After this sprint, the supervisor knows *where* disk pressure comes from — not just that it exists. When disk hits 80%, the supervisor can distinguish between "Docker images are bloated" vs "build cache is huge" vs "volumes are growing" vs "btrfs metadata overhead" and take the right remediation action. Trend-aware journaling lets the supervisor detect patterns across runs ("this image rebuilt 12 times today, 8 GB dangling layers") and escalate proactively before P1 thresholds are crossed.
|
||||
|
||||
## What exists today
|
||||
|
||||
- **Supervisor runs every 20 min** in the edge container via `supervisor-run.sh` -> `preflight.sh` -> Claude formula
|
||||
- **Edge container has Docker socket** (`/var/run/docker.sock` mount) and root access
|
||||
- **docker-cli installed** (Alpine `apk add docker-cli`)
|
||||
- **Disk check**: `df -h / | awk 'NR==2{print $5}'` — filesystem-level only, no Docker breakdown
|
||||
- **P1 remediation**: `docker system prune -f`, then `-a -f` if still >80%
|
||||
- **Preflight framework** is structured text with `## Section` headers — easy to extend
|
||||
- **Journal writing** already appends per-run findings to daily markdown files
|
||||
- **Vault filing** for unresolved issues already in place
|
||||
|
||||
All infrastructure needed for this sprint is already built. This is pure extension of existing capabilities.
|
||||
|
||||
## Complexity
|
||||
|
||||
- **3 files touched**: `supervisor/preflight.sh` (new section), `formulas/run-supervisor.toml` (enhanced P1 logic), possibly a small helper
|
||||
- **3-4 sub-issues** estimated
|
||||
- **Gluecode ratio: ~90%** — calling `docker system df`, parsing JSON output, adding conditional branches for storage driver
|
||||
- **Greenfield: ~10%** — btrfs-specific detection logic (if btrfs driver detected)
|
||||
|
||||
## Risks
|
||||
|
||||
- **btrfs tools not in container image**: `btrfs filesystem df`, `compsize` etc. require `btrfs-progs` package. May need an `apk add` in the Dockerfile or graceful degradation when tools are absent.
|
||||
- **`docker system df -v` can be slow**: On systems with hundreds of images, verbose output takes seconds. Preflight already runs in a time-bounded context, but worth watching.
|
||||
- **btrfs CoW sharing makes size reporting unreliable**: Apparent size != exclusive size. The sprint should report both where available, but document the caveat clearly.
|
||||
- **Not all deployments use btrfs**: The solution must work with overlay2 (the common case) and degrade gracefully — btrfs-specific telemetry is additive, not required.
|
||||
|
||||
## Cost — new infra to maintain
|
||||
|
||||
- **No new services, cron jobs, or containers** — extends existing supervisor preflight
|
||||
- **No new formulas** — extends existing `run-supervisor.toml`
|
||||
- **Minimal ongoing cost**: if Docker changes `docker system df` output format (unlikely), the parser needs updating. Otherwise maintenance-free.
|
||||
- **One conditional branch** in remediation logic (storage-driver-aware cleanup) — adds ~20 lines to the formula
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Worth it.** This sprint addresses a real production incident (harb-dev-box 98% disk with no visible culprit), the fix is low-risk gluecode extending well-tested infrastructure, and it adds zero ongoing maintenance burden. The information gap is the root cause of blind remediation — the supervisor currently prunes and hopes. With Docker storage telemetry, it can make informed decisions and escalate intelligently.
|
||||
|
||||
The btrfs-specific parts should degrade gracefully when tools are absent, making this useful on all storage drivers while providing extra resolution on btrfs deployments.
|
||||
Loading…
Add table
Add a link
Reference in a new issue