bug: entrypoint clones project at /home/agent/repos/${COMPOSE_PROJECT_NAME} but TOML parse later rewrites PROJECT_REPO_ROOT — dev-agent cd fails silently #861
Labels
No labels
action
backlog
blocked
bug-report
cannot-reproduce
in-progress
in-triage
needs-triage
prediction/actioned
prediction/dismissed
prediction/unreviewed
priority
rejected
reproduced
tech-debt
underspecified
vision
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: disinto-admin/disinto#861
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
For a freshly-created agent container (e.g. a newly-hired llama dev via
hire-an-agent), everydev-agent.shinvocation dies silently after thesetting up worktreelog line with no visible error. The issue only surfaces in/home/agent/data/logs/dev/dev-agent-<project>.log(where dev-poll redirects the subshell stderr):Under
set -e, the failedcd "$REPO_ROOT"(dev-agent.sh:257) exits the script. ThecleanupEXIT trap fires, releases the issue, no PR ever gets created. Next dev-poll iteration 60s later finds the same issue still ready, launches dev-agent again, same failure. Infinite claim/release flip-flop visible in Forgejo (timeline spam, no progress).Root cause — order of operations in
docker/agents/entrypoint.shTwo things happen in the wrong order:
local repo_dir="/home/agent/repos/${PROJECT_NAME}"— uses thePROJECT_NAMEprovided by compose env (which is the fallbackproject, not the TOML'sname). Clone lands at/home/agent/repos/project.export PROJECT_NAME="$_pname"/export PROJECT_REPO_ROOT="/home/agent/repos/${_pname}"— now reassigns both based on TOMLnamefield (e.g.disinto).PROJECT_REPO_ROOTbecomes/home/agent/repos/disinto.Result:
PROJECT_REPO_ROOTpoints at a directory that was never created.dev-agent.sh:257'scd "$REPO_ROOT"fails.Why dev-qwen (existing agent) doesn't hit this
dev-qwen'sproject-repos-llamavolume happens to contain/home/agent/repos/disintofrom a historical run (before today's refactor, the legacy compose stanza usedPROJECT_NAME=disintodirectly). Fresh agents get a fresh volume with only/home/agent/repos/project, and there's no dir atdisinto.Volume contents today:
project-repos-llama):_factory, disinto, disinto-ops, project← has the historical disinto/ dirproject-repos-dev-qwen2, fresh):_factory, disinto-ops, project← missingWhy the silent failure is hard to diagnose
dev-agent.sh's internallog()writes todev-agent.log(a separate file). Thecderror goes to stderr, which dev-poll redirects todev-agent-<project>.log. Two log files, two streams, no cross-reference.dev-agent.logshowssetting up worktreeas the last line, implying the worktree setup block was entered. Nothing about the cd failure.cleanupEXIT trap fires and logscleanup: releasing issue (no PR created)todev-agent.log, masking the real cause.dev-agent.shalone doesn't help — thecdtarget is correct per the code; the bug is upstream in how$REPO_ROOTwas populated.Fix options
Option A (minimal, targeted): fix entrypoint ordering
Move the TOML parse +
PROJECT_NAME/PROJECT_REPO_ROOTre-export to BEFORE the clone. The clone already uses${PROJECT_NAME}so it will then land at/home/agent/repos/<toml_name>and everything downstream (REPO_ROOT,cd, git ops) resolves consistently.Option B (broader): drop
PROJECT_NAMEfrom compose env, let TOML be the single sourcelib/generators.shcurrently emitsPROJECT_NAME: projectas a default in every agent service stanza. Remove that line; let entrypoint derivePROJECT_NAMEexclusively from the TOML. Eliminates the shadow value that causes the mismatch.Option C (defensive, orthogonal): surface the silent failure
Regardless of the ordering fix,
dev-agent.sh:257should not die silently. Add explicit error handling:The
log()call writes todev-agent.log, where a human looking at the dev-agent log sees the real cause immediately.Recommended rollout
PROJECT_NAMEas an override for edge cases.Acceptance criteria
disinto hire-an-agent <name> dev --local-model <url>produces its first PR without manual intervention (no/home/agent/repos/<pname>symlink workarounds)cdfailure indev-agent.shproduces a clear error indev-agent.log(not just in the dev-poll-level log)disinto/dir) is unaffectedprojects/disinto.tomlname-field-driven values are consistently used by all agent containersAffected files
docker/agents/entrypoint.sh— move TOML parse /PROJECT_NAME+PROJECT_REPO_ROOTre-export above the project-clone step (line 281)dev/dev-agent.sh— guardcd "$REPO_ROOT"with a clear error message (line 257)lib/generators.sh— dropPROJECT_NAME: projectdefault from the per-agent stanza if going with option BReproduction
Context
Diagnosed today during an attempt to scale dev throughput by adding a second llama-backed agent (dev-qwen2). 30+ minutes of flip-flop before the stderr redirect file was checked and revealed the cd error. All code upstream of
cd "$REPO_ROOT"executed fine (claim succeeded, worktree setup status logged); the failure was trivially diagnosable if surfaced, but the silent-exit behavior underset -emade it invisible.Related: #845, #847, #855, #856 — all in the same class of "new-agent hire path has silent failure modes that manual testing doesn't hit".
cdfails silently (#861) #864