supervisor: every run silently exits at formula_worktree_setup — FORGE_REMOTE unbound #1120
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#1120
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
supervisor-run.shhas been effectively dead since 2026-04-19T19:03Z (roughly 48h at time of filing). The polling loop indocker/agents/entrypoint.shinvokes it every 20 min and the script writes--- Supervisor run start ---todata/logs/supervisor/supervisor.log, but no subsequent log lines — no preflight, no journal entries, no--- Supervisor run done ---. ~200 iterations in a row show this pattern.Effect: WP-agent health recovery is off (CI-exhausted sweep in
supervisor-run.shnever runs), worktree GC is off, journal writes are off, scratch-file compaction state never persists.Reproduction
Inside
disinto-agents:The entrypoint redirects the supervisor invocation's stderr to
data/logs/supervisor.log(singular, not the agent-writtensupervisor/supervisor.log). That sink holds the real error.Diagnosis
lib/formula-session.shexposes two functions:resolve_forge_remote()at lines 86–98 — walksgit remote -vto setFORGE_REMOTE, exports it, falls back tooriginif no match.formula_worktree_setup()at lines 794–804 — uses${FORGE_REMOTE}ingit fetchandgit worktree add. Its docstring at line 795–796 states:supervisor-run.shnever callsresolve_forge_remote. It jumps straight frombuild_sdk_prompt_footertoformula_worktree_setup "$WORKTREE". Withset -uactive, the first${FORGE_REMOTE}expansion aborts the script with exit code 1 before any further log() call. The abort happens silently because the outer entrypoint backgrounds the process (&) and only captures stderr to a separate log that nobody polls.Fix sketch
Two candidates:
One-line call site fix (smallest, matches how review-poll / dev-poll handle it): add
resolve_forge_remoteinsupervisor-run.shright after thecd "$PROJECT_REPO_ROOT"step and beforeformula_worktree_setup "$WORKTREE". This mirrors the pattern used by dev-poll/review-poll loops.Defensive fix in formula_worktree_setup — call
resolve_forge_remoteinsideformula_worktree_setupitself ifFORGE_REMOTEis unset, so every caller is safe. Slightly larger blast radius but eliminates the precondition footgun entirely.Option 2 is preferable: the silent-abort-on-unbound was invisible for two days, and adding a precondition comment on a function is weaker than making the function self-heal.
Acceptance
data/logs/supervisor/supervisor.logshows--- Supervisor run start ---followed byRunning preflight.shand eventually--- Supervisor run done ---on every iteration.data/logs/supervisor.log(the stderr sink) does NOT containFORGE_REMOTE: unbound variableentries going forward.WP agent restart and issue recovery completeorWP Agent Health: healthypath).Related
/tmp/{architect,gardener,planner,predictor}-run.lockfrom the 2026-04-21T11:58Z 6h tick — not released, PIDs dead. May be a related-but-separate bug (agents crashing withouttrap EXITreleasing the lock). Worth a follow-up issue if reproduces after the supervisor fix lands.