fix: standardize logging across all agents — capture errors, log exit codes, consistent format #367

New issue

Open

opened 2026-04-07 17:28:09 +00:00 by dev-bot · 0 comments

dev-bot commented

2026-04-07 17:28:09 +00:00

Collaborator

Problem

Agent logging is inconsistent and swallows errors. When things fail, the logs say "failed" with no context — no exit code, no stderr, no output. This made a simple PATH issue (claude not found) take hours to diagnose.

Audit findings

1. Subprocess output discarded on failure

These call subprocesses but discard their output when they fail:

review/review-poll.sh lines 129, 169: review-pr.sh output goes nowhere on failure, only logs "review failed"
lib/agent-sdk.sh line 51: || true swallows claude exit code — timeout (124) vs crash (1) vs success (0) are indistinguishable
lib/agent-sdk.sh line 82: same for nudge invocation
lib/pr-lifecycle.sh line 364: pr_close failure fully silenced
lib/pr-lifecycle.sh lines 403-405: git rebase/push output printed to stdout then discarded

2. Inconsistent log format

Two formats in use:

Format A (gardener, planner, architect, predictor, supervisor): [2026-04-03T14:00:00Z] message via echo
Format B (review-poll, dev-poll, dev-agent): [2026-04-03 14:00:00 UTC] message via printf
dispatcher.sh: stdout only, no persistent file

3. Inconsistent log destinations

Some agents log to DISINTO_LOG_DIR, some to SCRIPT_DIR (which is read-only in containers). The gardener LOG_FILE bug (#210) was fixed but the pattern persists in other agents.

Fix

A. Capture and log subprocess errors

For every subprocess call that can fail, capture output and log last lines on error:

output=$(some_command 2>&1) || {
  log "command failed (exit $?): $(echo "$output" | tail -3)"
}

Specific changes:

review/review-poll.sh lines 129, 169: capture review-pr.sh output, log tail on failure
lib/agent-sdk.sh lines 51, 82: capture claude exit code, log it (distinguish timeout/crash/success)
lib/pr-lifecycle.sh line 364: log pr_close failure with HTTP code
lib/pr-lifecycle.sh lines 403-405: log git rebase/push failure output
gardener-run.sh manifest API calls: log HTTP status on failure (replace -sf with -s -o /dev/null -w '%{http_code}')

B. Standardize log format

Use a single log() function from lib/env.sh. All agents should use the same format and not redefine log(). Recommended format:

[2026-04-03T14:00:00Z] agent: message

Where agent is the agent name (dev, review, gardener, etc.).

C. Standardize log destinations

All agents should log to ${DISINTO_LOG_DIR}/<agent>/<agent>.log. Never to $SCRIPT_DIR (read-only in containers).

Affected files

lib/env.sh (define canonical log function)
lib/agent-sdk.sh (log claude exit codes)
lib/pr-lifecycle.sh (log pr_close and rebase failures)
review/review-poll.sh (capture review-pr.sh output)
gardener/gardener-run.sh (log manifest API failures with HTTP codes)
supervisor/supervisor-run.sh (log preflight exit code)
All *-run.sh scripts (use canonical log, use DISINTO_LOG_DIR)

Acceptance criteria

Every subprocess failure includes exit code and last 3 lines of output in the log
All agents use the same log format
All agents log to DISINTO_LOG_DIR
No log() redefinitions in agent scripts — all use the canonical one from lib/env.sh
dispatcher.sh writes to a persistent log file

## Problem Agent logging is inconsistent and swallows errors. When things fail, the logs say "failed" with no context — no exit code, no stderr, no output. This made a simple PATH issue (claude not found) take hours to diagnose. ## Audit findings ### 1. Subprocess output discarded on failure These call subprocesses but discard their output when they fail: - `review/review-poll.sh` lines 129, 169: review-pr.sh output goes nowhere on failure, only logs "review failed" - `lib/agent-sdk.sh` line 51: `|| true` swallows claude exit code — timeout (124) vs crash (1) vs success (0) are indistinguishable - `lib/agent-sdk.sh` line 82: same for nudge invocation - `lib/pr-lifecycle.sh` line 364: pr_close failure fully silenced - `lib/pr-lifecycle.sh` lines 403-405: git rebase/push output printed to stdout then discarded ### 2. Inconsistent log format Two formats in use: - Format A (gardener, planner, architect, predictor, supervisor): `[2026-04-03T14:00:00Z] message` via echo - Format B (review-poll, dev-poll, dev-agent): `[2026-04-03 14:00:00 UTC] message` via printf - dispatcher.sh: stdout only, no persistent file ### 3. Inconsistent log destinations Some agents log to DISINTO_LOG_DIR, some to SCRIPT_DIR (which is read-only in containers). The gardener LOG_FILE bug (#210) was fixed but the pattern persists in other agents. ## Fix ### A. Capture and log subprocess errors For every subprocess call that can fail, capture output and log last lines on error: output=$(some_command 2>&1) || { log "command failed (exit $?): $(echo "$output" | tail -3)" } Specific changes: - `review/review-poll.sh` lines 129, 169: capture review-pr.sh output, log tail on failure - `lib/agent-sdk.sh` lines 51, 82: capture claude exit code, log it (distinguish timeout/crash/success) - `lib/pr-lifecycle.sh` line 364: log pr_close failure with HTTP code - `lib/pr-lifecycle.sh` lines 403-405: log git rebase/push failure output - `gardener-run.sh` manifest API calls: log HTTP status on failure (replace -sf with -s -o /dev/null -w '%{http_code}') ### B. Standardize log format Use a single log() function from lib/env.sh. All agents should use the same format and not redefine log(). Recommended format: [2026-04-03T14:00:00Z] agent: message Where agent is the agent name (dev, review, gardener, etc.). ### C. Standardize log destinations All agents should log to `${DISINTO_LOG_DIR}/<agent>/<agent>.log`. Never to `$SCRIPT_DIR` (read-only in containers). ## Affected files - lib/env.sh (define canonical log function) - lib/agent-sdk.sh (log claude exit codes) - lib/pr-lifecycle.sh (log pr_close and rebase failures) - review/review-poll.sh (capture review-pr.sh output) - gardener/gardener-run.sh (log manifest API failures with HTTP codes) - supervisor/supervisor-run.sh (log preflight exit code) - All *-run.sh scripts (use canonical log, use DISINTO_LOG_DIR) ## Acceptance criteria - [ ] Every subprocess failure includes exit code and last 3 lines of output in the log - [ ] All agents use the same log format - [ ] All agents log to DISINTO_LOG_DIR - [ ] No log() redefinitions in agent scripts — all use the canonical one from lib/env.sh - [ ] dispatcher.sh writes to a persistent log file

dev-bot added the

backlog

label 2026-04-07 17:28:09 +00:00

dev-bot changed title from ~~fix: review-poll discards review-pr.sh error output — failures have no context~~ to fix: standardize logging across all agents — capture errors, log exit codes, consistent format

2026-04-07 17:31:44 +00:00