fix: standardize logging across all agents — capture errors, log exit codes, consistent format #367

Open
opened 2026-04-07 17:28:09 +00:00 by dev-bot · 0 comments
Collaborator

Problem

Agent logging is inconsistent and swallows errors. When things fail, the logs say "failed" with no context — no exit code, no stderr, no output. This made a simple PATH issue (claude not found) take hours to diagnose.

Audit findings

1. Subprocess output discarded on failure

These call subprocesses but discard their output when they fail:

  • review/review-poll.sh lines 129, 169: review-pr.sh output goes nowhere on failure, only logs "review failed"
  • lib/agent-sdk.sh line 51: || true swallows claude exit code — timeout (124) vs crash (1) vs success (0) are indistinguishable
  • lib/agent-sdk.sh line 82: same for nudge invocation
  • lib/pr-lifecycle.sh line 364: pr_close failure fully silenced
  • lib/pr-lifecycle.sh lines 403-405: git rebase/push output printed to stdout then discarded

2. Inconsistent log format

Two formats in use:

  • Format A (gardener, planner, architect, predictor, supervisor): [2026-04-03T14:00:00Z] message via echo
  • Format B (review-poll, dev-poll, dev-agent): [2026-04-03 14:00:00 UTC] message via printf
  • dispatcher.sh: stdout only, no persistent file

3. Inconsistent log destinations

Some agents log to DISINTO_LOG_DIR, some to SCRIPT_DIR (which is read-only in containers). The gardener LOG_FILE bug (#210) was fixed but the pattern persists in other agents.

Fix

A. Capture and log subprocess errors

For every subprocess call that can fail, capture output and log last lines on error:

output=$(some_command 2>&1) || {
  log "command failed (exit $?): $(echo "$output" | tail -3)"
}

Specific changes:

  • review/review-poll.sh lines 129, 169: capture review-pr.sh output, log tail on failure
  • lib/agent-sdk.sh lines 51, 82: capture claude exit code, log it (distinguish timeout/crash/success)
  • lib/pr-lifecycle.sh line 364: log pr_close failure with HTTP code
  • lib/pr-lifecycle.sh lines 403-405: log git rebase/push failure output
  • gardener-run.sh manifest API calls: log HTTP status on failure (replace -sf with -s -o /dev/null -w '%{http_code}')

B. Standardize log format

Use a single log() function from lib/env.sh. All agents should use the same format and not redefine log(). Recommended format:

[2026-04-03T14:00:00Z] agent: message

Where agent is the agent name (dev, review, gardener, etc.).

C. Standardize log destinations

All agents should log to ${DISINTO_LOG_DIR}/<agent>/<agent>.log. Never to $SCRIPT_DIR (read-only in containers).

Affected files

  • lib/env.sh (define canonical log function)
  • lib/agent-sdk.sh (log claude exit codes)
  • lib/pr-lifecycle.sh (log pr_close and rebase failures)
  • review/review-poll.sh (capture review-pr.sh output)
  • gardener/gardener-run.sh (log manifest API failures with HTTP codes)
  • supervisor/supervisor-run.sh (log preflight exit code)
  • All *-run.sh scripts (use canonical log, use DISINTO_LOG_DIR)

Acceptance criteria

  • Every subprocess failure includes exit code and last 3 lines of output in the log
  • All agents use the same log format
  • All agents log to DISINTO_LOG_DIR
  • No log() redefinitions in agent scripts — all use the canonical one from lib/env.sh
  • dispatcher.sh writes to a persistent log file
## Problem Agent logging is inconsistent and swallows errors. When things fail, the logs say "failed" with no context — no exit code, no stderr, no output. This made a simple PATH issue (claude not found) take hours to diagnose. ## Audit findings ### 1. Subprocess output discarded on failure These call subprocesses but discard their output when they fail: - `review/review-poll.sh` lines 129, 169: review-pr.sh output goes nowhere on failure, only logs "review failed" - `lib/agent-sdk.sh` line 51: `|| true` swallows claude exit code — timeout (124) vs crash (1) vs success (0) are indistinguishable - `lib/agent-sdk.sh` line 82: same for nudge invocation - `lib/pr-lifecycle.sh` line 364: pr_close failure fully silenced - `lib/pr-lifecycle.sh` lines 403-405: git rebase/push output printed to stdout then discarded ### 2. Inconsistent log format Two formats in use: - Format A (gardener, planner, architect, predictor, supervisor): `[2026-04-03T14:00:00Z] message` via echo - Format B (review-poll, dev-poll, dev-agent): `[2026-04-03 14:00:00 UTC] message` via printf - dispatcher.sh: stdout only, no persistent file ### 3. Inconsistent log destinations Some agents log to DISINTO_LOG_DIR, some to SCRIPT_DIR (which is read-only in containers). The gardener LOG_FILE bug (#210) was fixed but the pattern persists in other agents. ## Fix ### A. Capture and log subprocess errors For every subprocess call that can fail, capture output and log last lines on error: output=$(some_command 2>&1) || { log "command failed (exit $?): $(echo "$output" | tail -3)" } Specific changes: - `review/review-poll.sh` lines 129, 169: capture review-pr.sh output, log tail on failure - `lib/agent-sdk.sh` lines 51, 82: capture claude exit code, log it (distinguish timeout/crash/success) - `lib/pr-lifecycle.sh` line 364: log pr_close failure with HTTP code - `lib/pr-lifecycle.sh` lines 403-405: log git rebase/push failure output - `gardener-run.sh` manifest API calls: log HTTP status on failure (replace -sf with -s -o /dev/null -w '%{http_code}') ### B. Standardize log format Use a single log() function from lib/env.sh. All agents should use the same format and not redefine log(). Recommended format: [2026-04-03T14:00:00Z] agent: message Where agent is the agent name (dev, review, gardener, etc.). ### C. Standardize log destinations All agents should log to `${DISINTO_LOG_DIR}/<agent>/<agent>.log`. Never to `$SCRIPT_DIR` (read-only in containers). ## Affected files - lib/env.sh (define canonical log function) - lib/agent-sdk.sh (log claude exit codes) - lib/pr-lifecycle.sh (log pr_close and rebase failures) - review/review-poll.sh (capture review-pr.sh output) - gardener/gardener-run.sh (log manifest API failures with HTTP codes) - supervisor/supervisor-run.sh (log preflight exit code) - All *-run.sh scripts (use canonical log, use DISINTO_LOG_DIR) ## Acceptance criteria - [ ] Every subprocess failure includes exit code and last 3 lines of output in the log - [ ] All agents use the same log format - [ ] All agents log to DISINTO_LOG_DIR - [ ] No log() redefinitions in agent scripts — all use the canonical one from lib/env.sh - [ ] dispatcher.sh writes to a persistent log file
dev-bot added the
backlog
label 2026-04-07 17:28:09 +00:00
dev-bot changed title from fix: review-poll discards review-pr.sh error output — failures have no context to fix: standardize logging across all agents — capture errors, log exit codes, consistent format 2026-04-07 17:31:44 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#367
No description provided.