supervisor: every run silently exits at formula_worktree_setup — FORGE_REMOTE unbound #1120

Open
opened 2026-04-21 13:09:59 +00:00 by dev-bot · 0 comments
Collaborator

Symptom

supervisor-run.sh has been effectively dead since 2026-04-19T19:03Z (roughly 48h at time of filing). The polling loop in docker/agents/entrypoint.sh invokes it every 20 min and the script writes --- Supervisor run start --- to data/logs/supervisor/supervisor.log, but no subsequent log lines — no preflight, no journal entries, no --- Supervisor run done ---. ~200 iterations in a row show this pattern.

Effect: WP-agent health recovery is off (CI-exhausted sweep in supervisor-run.sh never runs), worktree GC is off, journal writes are off, scratch-file compaction state never persists.

Reproduction

Inside disinto-agents:

docker exec disinto-agents bash -c 'tail -3 /home/agent/data/logs/supervisor.log'
# → /home/agent/repos/_factory/lib/formula-session.sh: line 800: FORGE_REMOTE: unbound variable

The entrypoint redirects the supervisor invocation's stderr to data/logs/supervisor.log (singular, not the agent-written supervisor/supervisor.log). That sink holds the real error.

Diagnosis

lib/formula-session.sh exposes two functions:

  • resolve_forge_remote() at lines 86–98 — walks git remote -v to set FORGE_REMOTE, exports it, falls back to origin if no match.
  • formula_worktree_setup() at lines 794–804 — uses ${FORGE_REMOTE} in git fetch and git worktree add. Its docstring at line 795–796 states:

    Requires globals: PROJECT_REPO_ROOT, PRIMARY_BRANCH, FORGE_REMOTE.
    Ensure resolve_forge_remote() is called before this function.

supervisor-run.sh never calls resolve_forge_remote. It jumps straight from build_sdk_prompt_footer to formula_worktree_setup "$WORKTREE". With set -u active, the first ${FORGE_REMOTE} expansion aborts the script with exit code 1 before any further log() call. The abort happens silently because the outer entrypoint backgrounds the process (&) and only captures stderr to a separate log that nobody polls.

Fix sketch

Two candidates:

  1. One-line call site fix (smallest, matches how review-poll / dev-poll handle it): add resolve_forge_remote in supervisor-run.sh right after the cd "$PROJECT_REPO_ROOT" step and before formula_worktree_setup "$WORKTREE". This mirrors the pattern used by dev-poll/review-poll loops.

  2. Defensive fix in formula_worktree_setup — call resolve_forge_remote inside formula_worktree_setup itself if FORGE_REMOTE is unset, so every caller is safe. Slightly larger blast radius but eliminates the precondition footgun entirely.

Option 2 is preferable: the silent-abort-on-unbound was invisible for two days, and adding a precondition comment on a function is weaker than making the function self-heal.

Acceptance

  • After fix, data/logs/supervisor/supervisor.log shows --- Supervisor run start --- followed by Running preflight.sh and eventually --- Supervisor run done --- on every iteration.
  • data/logs/supervisor.log (the stderr sink) does NOT contain FORGE_REMOTE: unbound variable entries going forward.
  • WP-agent CI-exhausted sweep demonstrably runs at least once end-to-end (log line WP agent restart and issue recovery complete or WP Agent Health: healthy path).
  • Observed stale lockfiles at /tmp/{architect,gardener,planner,predictor}-run.lock from the 2026-04-21T11:58Z 6h tick — not released, PIDs dead. May be a related-but-separate bug (agents crashing without trap EXIT releasing the lock). Worth a follow-up issue if reproduces after the supervisor fix lands.
## Symptom `supervisor-run.sh` has been effectively dead since **2026-04-19T19:03Z** (roughly 48h at time of filing). The polling loop in `docker/agents/entrypoint.sh` invokes it every 20 min and the script writes `--- Supervisor run start ---` to `data/logs/supervisor/supervisor.log`, but **no subsequent log lines** — no preflight, no journal entries, no `--- Supervisor run done ---`. ~200 iterations in a row show this pattern. Effect: WP-agent health recovery is off (CI-exhausted sweep in `supervisor-run.sh` never runs), worktree GC is off, journal writes are off, scratch-file compaction state never persists. ## Reproduction Inside `disinto-agents`: ```bash docker exec disinto-agents bash -c 'tail -3 /home/agent/data/logs/supervisor.log' # → /home/agent/repos/_factory/lib/formula-session.sh: line 800: FORGE_REMOTE: unbound variable ``` The entrypoint redirects the supervisor invocation's stderr to `data/logs/supervisor.log` (singular, not the agent-written `supervisor/supervisor.log`). That sink holds the real error. ## Diagnosis `lib/formula-session.sh` exposes two functions: - `resolve_forge_remote()` at lines 86–98 — walks `git remote -v` to set `FORGE_REMOTE`, exports it, falls back to `origin` if no match. - `formula_worktree_setup()` at lines 794–804 — uses `${FORGE_REMOTE}` in `git fetch` and `git worktree add`. Its docstring at line 795–796 states: > Requires globals: PROJECT_REPO_ROOT, PRIMARY_BRANCH, FORGE_REMOTE. > Ensure resolve_forge_remote() is called before this function. `supervisor-run.sh` never calls `resolve_forge_remote`. It jumps straight from `build_sdk_prompt_footer` to `formula_worktree_setup "$WORKTREE"`. With `set -u` active, the first `${FORGE_REMOTE}` expansion aborts the script with exit code 1 before any further log() call. The abort happens silently because the outer entrypoint backgrounds the process (`&`) and only captures stderr to a separate log that nobody polls. ## Fix sketch Two candidates: 1. **One-line call site fix** (smallest, matches how review-poll / dev-poll handle it): add `resolve_forge_remote` in `supervisor-run.sh` right after the `cd "$PROJECT_REPO_ROOT"` step and before `formula_worktree_setup "$WORKTREE"`. This mirrors the pattern used by dev-poll/review-poll loops. 2. **Defensive fix in formula_worktree_setup** — call `resolve_forge_remote` inside `formula_worktree_setup` itself if `FORGE_REMOTE` is unset, so every caller is safe. Slightly larger blast radius but eliminates the precondition footgun entirely. Option 2 is preferable: the silent-abort-on-unbound was invisible for two days, and adding a precondition comment on a function is weaker than making the function self-heal. ## Acceptance - After fix, `data/logs/supervisor/supervisor.log` shows `--- Supervisor run start ---` followed by `Running preflight.sh` and eventually `--- Supervisor run done ---` on every iteration. - `data/logs/supervisor.log` (the stderr sink) does NOT contain `FORGE_REMOTE: unbound variable` entries going forward. - WP-agent CI-exhausted sweep demonstrably runs at least once end-to-end (log line `WP agent restart and issue recovery complete` or `WP Agent Health: healthy` path). ## Related - Observed stale lockfiles at `/tmp/{architect,gardener,planner,predictor}-run.lock` from the 2026-04-21T11:58Z 6h tick — not released, PIDs dead. May be a related-but-separate bug (agents crashing without `trap EXIT` releasing the lock). Worth a follow-up issue if reproduces after the supervisor fix lands.
dev-bot added the
bug-report
reproduced
labels 2026-04-21 13:09:59 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1120
No description provided.