bug: entrypoint clones project at /home/agent/repos/${COMPOSE_PROJECT_NAME} but TOML parse later rewrites PROJECT_REPO_ROOT — dev-agent cd fails silently #861

Closed
opened 2026-04-16 11:22:16 +00:00 by dev-bot · 0 comments
Collaborator

Problem

For a freshly-created agent container (e.g. a newly-hired llama dev via hire-an-agent), every dev-agent.sh invocation dies silently after the setting up worktree log line with no visible error. The issue only surfaces in /home/agent/data/logs/dev/dev-agent-<project>.log (where dev-poll redirects the subshell stderr):

/home/agent/repos/_factory/dev/dev-agent.sh: line 257: cd: /home/agent/repos/disinto: No such file or directory

Under set -e, the failed cd "$REPO_ROOT" (dev-agent.sh:257) exits the script. The cleanup EXIT trap fires, releases the issue, no PR ever gets created. Next dev-poll iteration 60s later finds the same issue still ready, launches dev-agent again, same failure. Infinite claim/release flip-flop visible in Forgejo (timeline spam, no progress).

Root cause — order of operations in docker/agents/entrypoint.sh

Two things happen in the wrong order:

  1. Line 281 (project clone): local repo_dir="/home/agent/repos/${PROJECT_NAME}" — uses the PROJECT_NAME provided by compose env (which is the fallback project, not the TOML's name). Clone lands at /home/agent/repos/project.
  2. Line 382-383 (TOML-derived override): export PROJECT_NAME="$_pname" / export PROJECT_REPO_ROOT="/home/agent/repos/${_pname}" — now reassigns both based on TOML name field (e.g. disinto). PROJECT_REPO_ROOT becomes /home/agent/repos/disinto.

Result: PROJECT_REPO_ROOT points at a directory that was never created. dev-agent.sh:257's cd "$REPO_ROOT" fails.

Why dev-qwen (existing agent) doesn't hit this

dev-qwen's project-repos-llama volume happens to contain /home/agent/repos/disinto from a historical run (before today's refactor, the legacy compose stanza used PROJECT_NAME=disinto directly). Fresh agents get a fresh volume with only /home/agent/repos/project, and there's no dir at disinto.

Volume contents today:

  • dev-qwen (project-repos-llama): _factory, disinto, disinto-ops, project ← has the historical disinto/ dir
  • dev-qwen2 (project-repos-dev-qwen2, fresh): _factory, disinto-ops, project ← missing

Why the silent failure is hard to diagnose

  1. dev-agent.sh's internal log() writes to dev-agent.log (a separate file). The cd error goes to stderr, which dev-poll redirects to dev-agent-<project>.log. Two log files, two streams, no cross-reference.
  2. dev-agent.log shows setting up worktree as the last line, implying the worktree setup block was entered. Nothing about the cd failure.
  3. The cleanup EXIT trap fires and logs cleanup: releasing issue (no PR created) to dev-agent.log, masking the real cause.
  4. Investigating dev-agent.sh alone doesn't help — the cd target is correct per the code; the bug is upstream in how $REPO_ROOT was populated.

Fix options

Option A (minimal, targeted): fix entrypoint ordering

Move the TOML parse + PROJECT_NAME / PROJECT_REPO_ROOT re-export to BEFORE the clone. The clone already uses ${PROJECT_NAME} so it will then land at /home/agent/repos/<toml_name> and everything downstream (REPO_ROOT, cd, git ops) resolves consistently.

Option B (broader): drop PROJECT_NAME from compose env, let TOML be the single source

lib/generators.sh currently emits PROJECT_NAME: project as a default in every agent service stanza. Remove that line; let entrypoint derive PROJECT_NAME exclusively from the TOML. Eliminates the shadow value that causes the mismatch.

Option C (defensive, orthogonal): surface the silent failure

Regardless of the ordering fix, dev-agent.sh:257 should not die silently. Add explicit error handling:

if ! cd "$REPO_ROOT"; then
  log "ERROR: REPO_ROOT=${REPO_ROOT} does not exist — cannot cd. Check PROJECT_REPO_ROOT vs compose PROJECT_NAME vs TOML name mismatch."
  exit 1
fi

The log() call writes to dev-agent.log, where a human looking at the dev-agent log sees the real cause immediately.

  1. Land option C first (defensive log + exit 1) — helps diagnose this class of bug in the future without changing behavior.
  2. Land option A (ordering fix) — minimal behavior change, keeps compose-level PROJECT_NAME as an override for edge cases.
  3. Consider option B in a follow-up cleanup once option A settles.

Acceptance criteria

  • A freshly hired agent via disinto hire-an-agent <name> dev --local-model <url> produces its first PR without manual intervention (no /home/agent/repos/<pname> symlink workarounds)
  • cd failure in dev-agent.sh produces a clear error in dev-agent.log (not just in the dev-poll-level log)
  • Existing dev-qwen (historical volume with disinto/ dir) is unaffected
  • projects/disinto.toml name-field-driven values are consistently used by all agent containers

Affected files

  • docker/agents/entrypoint.sh — move TOML parse / PROJECT_NAME + PROJECT_REPO_ROOT re-export above the project-clone step (line 281)
  • dev/dev-agent.sh — guard cd "$REPO_ROOT" with a clear error message (line 257)
  • Optionally lib/generators.sh — drop PROJECT_NAME: project default from the per-agent stanza if going with option B

Reproduction

# On disinto-dev-box with a working disinto factory
disinto hire-an-agent dev-qwen2 dev --local-model http://10.10.10.1:8081 --model unsloth/Qwen3.5-35B-A3B
# Add dev-qwen2 as write collaborator (until #856 lands):
curl -X PUT -H "Authorization: token $FORGE_ADMIN_TOKEN" \
  http://forgejo:3000/api/v1/repos/disinto-admin/disinto/collaborators/dev-qwen2 \
  -d '{"permission":"write"}'
# Bring up:
docker compose --profile agents-dev-qwen2 up -d agents-dev-qwen2
# Watch dev-poll log:
docker exec disinto-agents-dev-qwen2 tail -f /home/agent/data/logs/dev/dev-agent-disinto.log
# Observe the flip-flop + the silent cd failure only visible in the -disinto.log file

Context

Diagnosed today during an attempt to scale dev throughput by adding a second llama-backed agent (dev-qwen2). 30+ minutes of flip-flop before the stderr redirect file was checked and revealed the cd error. All code upstream of cd "$REPO_ROOT" executed fine (claim succeeded, worktree setup status logged); the failure was trivially diagnosable if surfaced, but the silent-exit behavior under set -e made it invisible.

Related: #845, #847, #855, #856 — all in the same class of "new-agent hire path has silent failure modes that manual testing doesn't hit".

## Problem For a freshly-created agent container (e.g. a newly-hired llama dev via `hire-an-agent`), every `dev-agent.sh` invocation dies silently after the `setting up worktree` log line with no visible error. The issue only surfaces in `/home/agent/data/logs/dev/dev-agent-<project>.log` (where dev-poll redirects the subshell stderr): ``` /home/agent/repos/_factory/dev/dev-agent.sh: line 257: cd: /home/agent/repos/disinto: No such file or directory ``` Under `set -e`, the failed `cd "$REPO_ROOT"` (dev-agent.sh:257) exits the script. The `cleanup` EXIT trap fires, releases the issue, no PR ever gets created. Next dev-poll iteration 60s later finds the same issue still ready, launches dev-agent again, same failure. Infinite claim/release flip-flop visible in Forgejo (timeline spam, no progress). ## Root cause — order of operations in `docker/agents/entrypoint.sh` Two things happen in the wrong order: 1. **Line 281** (project clone): `local repo_dir="/home/agent/repos/${PROJECT_NAME}"` — uses the `PROJECT_NAME` provided by compose env (which is the fallback `project`, not the TOML's `name`). Clone lands at `/home/agent/repos/project`. 2. **Line 382-383** (TOML-derived override): `export PROJECT_NAME="$_pname"` / `export PROJECT_REPO_ROOT="/home/agent/repos/${_pname}"` — now reassigns both based on TOML `name` field (e.g. `disinto`). `PROJECT_REPO_ROOT` becomes `/home/agent/repos/disinto`. Result: `PROJECT_REPO_ROOT` points at a directory that was never created. `dev-agent.sh:257`'s `cd "$REPO_ROOT"` fails. ## Why dev-qwen (existing agent) doesn't hit this `dev-qwen`'s `project-repos-llama` volume happens to contain `/home/agent/repos/disinto` from a historical run (before today's refactor, the legacy compose stanza used `PROJECT_NAME=disinto` directly). Fresh agents get a fresh volume with only `/home/agent/repos/project`, and there's no dir at `disinto`. Volume contents today: - dev-qwen (`project-repos-llama`): `_factory, disinto, disinto-ops, project` ← has the historical disinto/ dir - dev-qwen2 (`project-repos-dev-qwen2`, fresh): `_factory, disinto-ops, project` ← missing ## Why the silent failure is hard to diagnose 1. `dev-agent.sh`'s internal `log()` writes to `dev-agent.log` (a separate file). The `cd` error goes to stderr, which dev-poll redirects to `dev-agent-<project>.log`. Two log files, two streams, no cross-reference. 2. `dev-agent.log` shows `setting up worktree` as the last line, implying the worktree setup block was entered. Nothing about the cd failure. 3. The `cleanup` EXIT trap fires and logs `cleanup: releasing issue (no PR created)` to `dev-agent.log`, masking the real cause. 4. Investigating `dev-agent.sh` alone doesn't help — the `cd` target is correct per the code; the bug is upstream in how `$REPO_ROOT` was populated. ## Fix options ### Option A (minimal, targeted): fix entrypoint ordering Move the TOML parse + `PROJECT_NAME` / `PROJECT_REPO_ROOT` re-export to BEFORE the clone. The clone already uses `${PROJECT_NAME}` so it will then land at `/home/agent/repos/<toml_name>` and everything downstream (`REPO_ROOT`, `cd`, git ops) resolves consistently. ### Option B (broader): drop `PROJECT_NAME` from compose env, let TOML be the single source `lib/generators.sh` currently emits `PROJECT_NAME: project` as a default in every agent service stanza. Remove that line; let entrypoint derive `PROJECT_NAME` exclusively from the TOML. Eliminates the shadow value that causes the mismatch. ### Option C (defensive, orthogonal): surface the silent failure Regardless of the ordering fix, `dev-agent.sh:257` should not die silently. Add explicit error handling: ```bash if ! cd "$REPO_ROOT"; then log "ERROR: REPO_ROOT=${REPO_ROOT} does not exist — cannot cd. Check PROJECT_REPO_ROOT vs compose PROJECT_NAME vs TOML name mismatch." exit 1 fi ``` The `log()` call writes to `dev-agent.log`, where a human looking at the dev-agent log sees the real cause immediately. ## Recommended rollout 1. Land option C first (defensive log + exit 1) — helps diagnose this class of bug in the future without changing behavior. 2. Land option A (ordering fix) — minimal behavior change, keeps compose-level `PROJECT_NAME` as an override for edge cases. 3. Consider option B in a follow-up cleanup once option A settles. ## Acceptance criteria - [ ] A freshly hired agent via `disinto hire-an-agent <name> dev --local-model <url>` produces its first PR without manual intervention (no `/home/agent/repos/<pname>` symlink workarounds) - [ ] `cd` failure in `dev-agent.sh` produces a clear error in `dev-agent.log` (not just in the dev-poll-level log) - [ ] Existing dev-qwen (historical volume with `disinto/` dir) is unaffected - [ ] `projects/disinto.toml` name-field-driven values are consistently used by all agent containers ## Affected files - `docker/agents/entrypoint.sh` — move TOML parse / `PROJECT_NAME` + `PROJECT_REPO_ROOT` re-export above the project-clone step (line 281) - `dev/dev-agent.sh` — guard `cd "$REPO_ROOT"` with a clear error message (line 257) - Optionally `lib/generators.sh` — drop `PROJECT_NAME: project` default from the per-agent stanza if going with option B ## Reproduction ```bash # On disinto-dev-box with a working disinto factory disinto hire-an-agent dev-qwen2 dev --local-model http://10.10.10.1:8081 --model unsloth/Qwen3.5-35B-A3B # Add dev-qwen2 as write collaborator (until #856 lands): curl -X PUT -H "Authorization: token $FORGE_ADMIN_TOKEN" \ http://forgejo:3000/api/v1/repos/disinto-admin/disinto/collaborators/dev-qwen2 \ -d '{"permission":"write"}' # Bring up: docker compose --profile agents-dev-qwen2 up -d agents-dev-qwen2 # Watch dev-poll log: docker exec disinto-agents-dev-qwen2 tail -f /home/agent/data/logs/dev/dev-agent-disinto.log # Observe the flip-flop + the silent cd failure only visible in the -disinto.log file ``` ## Context Diagnosed today during an attempt to scale dev throughput by adding a second llama-backed agent (dev-qwen2). 30+ minutes of flip-flop before the stderr redirect file was checked and revealed the cd error. All code upstream of `cd "$REPO_ROOT"` executed fine (claim succeeded, worktree setup status logged); the failure was trivially diagnosable if surfaced, but the silent-exit behavior under `set -e` made it invisible. Related: #845, #847, #855, #856 — all in the same class of "new-agent hire path has silent failure modes that manual testing doesn't hit".
dev-bot added the
backlog
priority
labels 2026-04-16 11:22:16 +00:00
dev-qwen self-assigned this 2026-04-16 11:40:29 +00:00
dev-qwen added
in-progress
and removed
backlog
labels 2026-04-16 11:40:29 +00:00
dev-qwen removed their assignment 2026-04-16 11:51:01 +00:00
dev-qwen removed the
in-progress
label 2026-04-16 11:51:01 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#861
No description provided.