fix: fix: standardize logging across all agents — capture errors, log exit codes, consistent format (#367)
Some checks failed
ci/woodpecker/push/ci Pipeline failed
ci/woodpecker/pr/ci Pipeline failed
ci/woodpecker/pr/smoke-init Pipeline was successful

This commit is contained in:
Agent 2026-04-07 18:51:34 +00:00
parent f686d47a98
commit d216e6294f
11 changed files with 190 additions and 136 deletions

View file

@ -36,7 +36,7 @@ source "$FACTORY_ROOT/lib/guard.sh"
# shellcheck source=../lib/agent-sdk.sh
source "$FACTORY_ROOT/lib/agent-sdk.sh"
LOG_FILE="$SCRIPT_DIR/architect.log"
LOG_FILE="${DISINTO_LOG_DIR}/architect/architect.log"
# shellcheck disable=SC2034 # consumed by agent-sdk.sh
LOGFILE="$LOG_FILE"
# shellcheck disable=SC2034 # consumed by agent-sdk.sh
@ -44,7 +44,8 @@ SID_FILE="/tmp/architect-session-${PROJECT_NAME}.sid"
SCRATCH_FILE="/tmp/architect-${PROJECT_NAME}-scratch.md"
WORKTREE="/tmp/${PROJECT_NAME}-architect-run"
log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%S)Z] $*" >> "$LOG_FILE"; }
# Override LOG_AGENT for consistent agent identification
LOG_AGENT="architect"
# ── Guards ────────────────────────────────────────────────────────────────
check_active architect

View file

@ -1,54 +1,57 @@
# Working with the factory — lessons learned
# Lessons Learned
## Writing issues for the dev agent
## Issue Management
**Put everything in the issue body, not comments.** The dev agent reads the issue body when it starts work. It does not reliably read comments. If an issue fails and you need to add guidance for a retry, update the issue body.
**Put everything in the issue body.** Agents read the issue body; comments are unreliable. Add guidance by updating the body.
**One approach per issue, no choices.** The dev agent cannot make design decisions. If there are multiple ways to solve a problem, decide before filing. Issues with "Option A or Option B" will confuse the agent.
**One decision per issue.** Agents cannot choose between approaches. Decide before filing.
**Issues must fit the templates.** Every backlog issue needs: affected files (max 3), acceptance criteria (max 5 checkboxes), and a clear proposed solution. If you cannot fill these fields, the issue is too big — label it `vision` and break it down first.
**Explicit dependencies prevent ordering bugs.** Use `Depends-on: #N` so agents don't work on stale code.
**Explicit dependencies prevent ordering bugs.** Add `Depends-on: #N` in the issue body. dev-poll checks these before pickup. Without explicit deps, the agent may attempt work on a stale codebase.
**Fit issues to templates.** Max 3 files, 5 acceptance criteria, clear solution. If it doesn't fit, it's too big—label `vision` and decompose.
## Debugging CI failures
## Debugging CI Failures
**Check CI logs via Woodpecker SQLite when the API fails.** The Woodpecker v3 log API may return HTML instead of JSON. Reliable fallback:
```bash
sqlite3 /var/lib/docker/volumes/disinto_woodpecker-data/_data/woodpecker.sqlite \
"SELECT le.data FROM log_entries le \
JOIN steps s ON le.step_id = s.id \
JOIN workflows w ON s.pipeline_id = w.id \
JOIN pipelines p ON w.pipeline_id = p.id \
WHERE p.number = <N> AND s.name = '<step>' ORDER BY le.id"
```
**Agents can't see CI logs.** When an issue fails repeatedly, diagnose externally and put the exact error and fix in the issue body.
**When the agent fails repeatedly on CI, diagnose externally.** The dev agent cannot see CI log output (only pass/fail status). If the same step fails 3+ times, read the logs yourself and put the exact error and fix in the issue body.
**Blocked CI often means environment, not code.** Missing libraries, auth failures, or path assumptions usually point to infrastructure misconfiguration.
## Retrying failed issues
## Base Image and Environment
**Clean up stale branches before retrying.** Old branches cause recovery mode which inherits stale code. Close the PR, delete the branch on Forgejo, then relabel to backlog.
**Minimal images introduce hidden traps.** Alpine/musl libc breaks tools that work on glibc. Validate base image assumptions before committing to a path.
**After a dependency lands, stale branches miss the fix.** If issue B depends on A, and B's PR was created before A merged, B's branch is stale. Close the PR and delete the branch so the agent starts fresh from current main.
**Ask: "What system libraries does this tool actually need?"** Don't assume development environment constraints apply to production images.
## Environment gotchas
**Document base image tradeoffs.** Alpine saves space but costs compatibility. Ubuntu saves compatibility but costs size. Make decisions explicit.
**Alpine/BusyBox differs from Debian.** CI and edge containers use Alpine:
- `grep -P` (Perl regex) does not work — use `grep -E`
- `USER` variable is unset — set it explicitly: `USER=$(whoami); export USER`
- Network calls fail during `docker build` in LXD — download binaries on the host, COPY into images
## Authentication
**The host repo drifts from Forgejo main.** If factory code is bind-mounted, the host checkout goes stale. Pull regularly or use versioned releases.
**Treat auth as first-class.** Authentication assumptions are the most fragile part of any integration.
## Vault operations
**Audit all external calls.** Map every operation touching remote systems and verify each has proper auth—even internal repos can require credentials.
**The human merging a vault PR must be a Forgejo site admin.** The dispatcher verifies `is_admin` on the merger. Promote your user via the Forgejo CLI or database if needed.
**Test in failure mode.** Don't just verify the happy path. Simulate restricted access environments.
**Result files cache failures.** If a vault action fails, the dispatcher writes `.result.json` and skips it. To retry: delete the result file inside the edge container.
**Verify the entire chain.** Each layer in a call chain can strip credentials. Check every script, command, and library boundary.
## Breaking down large features
**Special cases often bypass standard flows.** "Edge" or "beta" builds frequently have auth gaps.
**Vision issues need structured decomposition.** When a feature touches multiple subsystems or has design forks, label it `vision`. Break it down by identifying what exists, what can be reused, where the design forks are, and resolve them before filing backlog issues.
## Concurrency and Locking
**Prefer gluecode over greenfield.** Check if Forgejo API, Woodpecker, Docker, or existing lib/ functions can do the job before building new components.
**Scope conditions to the actor, not the system.** In multi-agent systems, distinguish "in progress" from "in progress by me."
**Max 7 sub-issues per sprint.** If a breakdown produces more, split into two sprints.
**Design blocking points with "who does this affect?" comments.** Challenge any global-blocking condition: "What if five agents hit this simultaneously?"
**State falls into three categories:** global (affects all), actor-local, and shared resources. Only block on global state or own ownership.
## Branch and PR Hygiene
**Clean up stale branches before retrying.** Old branches cause recovery mode with stale code. Close PR, delete branch, then relabel.
**After a dependency lands, stale branches miss the fix.** If issue B depends on A and B's PR predates A's merge, close and recreate the branch.
## Feature Decomposition
**Prefer gluecode over greenfield.** Check existing APIs, libraries, or utilities before building new components.
**Max 7 sub-issues per sprint.** If more, split into two sprints.

View file

@ -47,9 +47,14 @@ VAULT_ENV="${SCRIPT_ROOT}/../vault/vault-env.sh"
# Comma-separated list of Forgejo usernames with admin role
ADMIN_USERS="${FORGE_ADMIN_USERS:-vault-bot,admin}"
# Log function
# Persistent log file for dispatcher
DISPATCHER_LOG_FILE="${DISINTO_LOG_DIR:-/tmp}/dispatcher/dispatcher.log"
mkdir -p "$(dirname "$DISPATCHER_LOG_FILE")"
# Log function with standardized format
log() {
printf '[%s] %s\n' "$(date -u '+%Y-%m-%d %H:%M:%S UTC')" "$*"
local agent="${LOG_AGENT:-dispatcher}"
printf '[%s] %s: %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "$agent" "$*" >> "$DISPATCHER_LOG_FILE"
}
# -----------------------------------------------------------------------------
@ -185,7 +190,7 @@ verify_admin_approver() {
# Fetch reviews for this PR
local reviews_json
reviews_json=$(get_pr_reviews "$pr_number") || {
log "WARNING: Could not fetch reviews for PR #${pr_number} — skipping"
log "WARNING: Could not fetch reviews for PR #${pr_number} — skipping" >> "$DISPATCHER_LOG_FILE"
return 1
}
@ -193,7 +198,7 @@ verify_admin_approver() {
local review_count
review_count=$(echo "$reviews_json" | jq 'length // 0')
if [ "$review_count" -eq 0 ]; then
log "WARNING: No reviews found for PR #${pr_number} — rejecting"
log "WARNING: No reviews found for PR #${pr_number} — rejecting" >> "$DISPATCHER_LOG_FILE"
return 1
fi
@ -216,12 +221,12 @@ verify_admin_approver() {
# Check if reviewer is admin
if is_allowed_admin "$reviewer"; then
log "Verified: PR #${pr_number} approved by admin '${reviewer}'"
log "Verified: PR #${pr_number} approved by admin '${reviewer}'" >> "$DISPATCHER_LOG_FILE"
return 0
fi
done < <(echo "$reviews_json" | jq -c '.[]')
log "WARNING: No admin approval found for PR #${pr_number} — rejecting"
log "WARNING: No admin approval found for PR #${pr_number} — rejecting" >> "$DISPATCHER_LOG_FILE"
return 1
}
@ -243,11 +248,11 @@ verify_admin_merged() {
# Get the PR that introduced this file
local pr_num
pr_num=$(get_pr_for_file "$toml_file") || {
log "WARNING: No PR found for action ${action_id} — skipping (possible direct push)"
log "WARNING: No PR found for action ${action_id} — skipping (possible direct push)" >> "$DISPATCHER_LOG_FILE"
return 1
}
log "Action ${action_id} arrived via PR #${pr_num}"
log "Action ${action_id} arrived via PR #${pr_num}" >> "$DISPATCHER_LOG_FILE"
# First, try admin approver check (for auto-merge workflow)
if verify_admin_approver "$pr_num" "$action_id"; then
@ -257,7 +262,7 @@ verify_admin_merged() {
# Fallback: Check merger (backwards compatibility for manual merges)
local merger_json
merger_json=$(get_pr_merger "$pr_num") || {
log "WARNING: Could not fetch PR #${pr_num} details — skipping"
log "WARNING: Could not fetch PR #${pr_num} details — skipping" >> "$DISPATCHER_LOG_FILE"
return 1
}
@ -267,22 +272,22 @@ verify_admin_merged() {
# Check if PR is merged
if [[ "$merged" != "true" ]]; then
log "WARNING: PR #${pr_num} is not merged — skipping"
log "WARNING: PR #${pr_num} is not merged — skipping" >> "$DISPATCHER_LOG_FILE"
return 1
fi
# Check if merger is admin
if [ -z "$merger_username" ]; then
log "WARNING: Could not determine PR #${pr_num} merger — skipping"
log "WARNING: Could not determine PR #${pr_num} merger — skipping" >> "$DISPATCHER_LOG_FILE"
return 1
fi
if ! is_allowed_admin "$merger_username"; then
log "WARNING: PR #${pr_num} merged by non-admin user '${merger_username}' — skipping"
log "WARNING: PR #${pr_num} merged by non-admin user '${merger_username}' — skipping" >> "$DISPATCHER_LOG_FILE"
return 1
fi
log "Verified: PR #${pr_num} merged by admin '${merger_username}' (fallback check)"
log "Verified: PR #${pr_num} merged by admin '${merger_username}' (fallback check)" >> "$DISPATCHER_LOG_FILE"
return 0
}
@ -343,7 +348,7 @@ write_result() {
'{id: $id, exit_code: $exit_code, timestamp: $timestamp, logs: $logs}' \
> "$result_file"
log "Result written: ${result_file}"
log "Result written: ${result_file}" >> "$DISPATCHER_LOG_FILE"
}
# Launch runner for the given action
@ -353,18 +358,18 @@ launch_runner() {
local action_id
action_id=$(basename "$toml_file" .toml)
log "Launching runner for action: ${action_id}"
log "Launching runner for action: ${action_id}" >> "$DISPATCHER_LOG_FILE"
# Validate TOML
if ! validate_action "$toml_file"; then
log "ERROR: Action validation failed for ${action_id}"
log "ERROR: Action validation failed for ${action_id}" >> "$DISPATCHER_LOG_FILE"
write_result "$action_id" 1 "Validation failed: see logs above"
return 1
fi
# Verify admin merge
if ! verify_admin_merged "$toml_file"; then
log "ERROR: Admin merge verification failed for ${action_id}"
log "ERROR: Admin merge verification failed for ${action_id}" >> "$DISPATCHER_LOG_FILE"
write_result "$action_id" 1 "Admin merge verification failed: see logs above"
return 1
fi
@ -400,7 +405,7 @@ launch_runner() {
fi
done
else
log "Action ${action_id} has no secrets declared — runner will execute without extra env vars"
log "Action ${action_id} has no secrets declared — runner will execute without extra env vars" >> "$DISPATCHER_LOG_FILE"
fi
# Add formula and action id as arguments (safe from shell injection)
@ -423,7 +428,7 @@ launch_runner() {
log_cmd+=("$arg")
fi
done
log "Running: ${log_cmd[*]}"
log "Running: ${log_cmd[*]}" >> "$DISPATCHER_LOG_FILE"
# Create temp file for logs
local log_file
@ -443,9 +448,9 @@ launch_runner() {
write_result "$action_id" "$exit_code" "$logs"
if [ $exit_code -eq 0 ]; then
log "Runner completed successfully for action: ${action_id}"
log "Runner completed successfully for action: ${action_id}" >> "$DISPATCHER_LOG_FILE"
else
log "Runner failed for action: ${action_id} (exit code: ${exit_code})"
log "Runner failed for action: ${action_id} (exit code: ${exit_code})" >> "$DISPATCHER_LOG_FILE"
fi
return $exit_code
@ -512,7 +517,7 @@ dispatch_reproduce() {
local issue_number="$1"
if is_reproduce_running "$issue_number"; then
log "Reproduce already running for issue #${issue_number}, skipping"
log "Reproduce already running for issue #${issue_number}, skipping" >> "$DISPATCHER_LOG_FILE"
return 0
fi
@ -523,11 +528,11 @@ dispatch_reproduce() {
done
if [ -z "$project_toml" ]; then
log "WARNING: no project TOML found under ${FACTORY_ROOT}/projects/ — skipping reproduce for #${issue_number}"
log "WARNING: no project TOML found under ${FACTORY_ROOT}/projects/ — skipping reproduce for #${issue_number}" >> "$DISPATCHER_LOG_FILE"
return 0
fi
log "Dispatching reproduce-agent for issue #${issue_number} (project: ${project_toml})"
log "Dispatching reproduce-agent for issue #${issue_number} (project: ${project_toml})" >> "$DISPATCHER_LOG_FILE"
# Build docker run command using array (safe from injection)
local -a cmd=(docker run --rm
@ -575,7 +580,7 @@ dispatch_reproduce() {
"${cmd[@]}" &
local bg_pid=$!
echo "$bg_pid" > "$(_reproduce_lockfile "$issue_number")"
log "Reproduce container launched (pid ${bg_pid}) for issue #${issue_number}"
log "Reproduce container launched (pid ${bg_pid}) for issue #${issue_number}" >> "$DISPATCHER_LOG_FILE"
}
# -----------------------------------------------------------------------------
@ -636,7 +641,7 @@ dispatch_triage() {
local issue_number="$1"
if is_triage_running "$issue_number"; then
log "Triage already running for issue #${issue_number}, skipping"
log "Triage already running for issue #${issue_number}, skipping" >> "$DISPATCHER_LOG_FILE"
return 0
fi
@ -647,7 +652,7 @@ dispatch_triage() {
done
if [ -z "$project_toml" ]; then
log "WARNING: no project TOML found under ${FACTORY_ROOT}/projects/ — skipping triage for #${issue_number}"
log "WARNING: no project TOML found under ${FACTORY_ROOT}/projects/ — skipping triage for #${issue_number}" >> "$DISPATCHER_LOG_FILE"
return 0
fi
@ -700,7 +705,7 @@ dispatch_triage() {
"${cmd[@]}" &
local bg_pid=$!
echo "$bg_pid" > "$(_triage_lockfile "$issue_number")"
log "Triage container launched (pid ${bg_pid}) for issue #${issue_number}"
log "Triage container launched (pid ${bg_pid}) for issue #${issue_number}" >> "$DISPATCHER_LOG_FILE"
}
# -----------------------------------------------------------------------------
@ -710,19 +715,19 @@ dispatch_triage() {
# Clone or pull the ops repo
ensure_ops_repo() {
if [ ! -d "${OPS_REPO_ROOT}/.git" ]; then
log "Cloning ops repo from ${FORGE_URL}/${FORGE_OPS_REPO}..."
log "Cloning ops repo from ${FORGE_URL}/${FORGE_OPS_REPO}..." >> "$DISPATCHER_LOG_FILE"
git clone "${FORGE_URL}/${FORGE_OPS_REPO}" "${OPS_REPO_ROOT}"
else
log "Pulling latest ops repo changes..."
log "Pulling latest ops repo changes..." >> "$DISPATCHER_LOG_FILE"
(cd "${OPS_REPO_ROOT}" && git pull --rebase)
fi
}
# Main dispatcher loop
main() {
log "Starting dispatcher..."
log "Polling ops repo: ${VAULT_ACTIONS_DIR}"
log "Admin users: ${ADMIN_USERS}"
log "Starting dispatcher..." >> "$DISPATCHER_LOG_FILE"
log "Polling ops repo: ${VAULT_ACTIONS_DIR}" >> "$DISPATCHER_LOG_FILE"
log "Admin users: ${ADMIN_USERS}" >> "$DISPATCHER_LOG_FILE"
while true; do
# Refresh ops repo at the start of each poll cycle
@ -730,7 +735,7 @@ main() {
# Check if actions directory exists
if [ ! -d "${VAULT_ACTIONS_DIR}" ]; then
log "Actions directory not found: ${VAULT_ACTIONS_DIR}"
log "Actions directory not found: ${VAULT_ACTIONS_DIR}" >> "$DISPATCHER_LOG_FILE"
sleep 60
continue
fi
@ -745,7 +750,7 @@ main() {
# Skip if already completed
if is_action_completed "$action_id"; then
log "Action ${action_id} already completed, skipping"
log "Action ${action_id} already completed, skipping" >> "$DISPATCHER_LOG_FILE"
continue
fi

View file

@ -55,7 +55,8 @@ RESULT_FILE="/tmp/gardener-result-${PROJECT_NAME}.txt"
GARDENER_PR_FILE="/tmp/gardener-pr-${PROJECT_NAME}.txt"
WORKTREE="/tmp/${PROJECT_NAME}-gardener-run"
log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%S)Z] $*" >> "$LOG_FILE"; }
# Override LOG_AGENT for consistent agent identification
LOG_AGENT="gardener"
# ── Guards ────────────────────────────────────────────────────────────────
check_active gardener
@ -156,19 +157,21 @@ _gardener_execute_manifest() {
case "$action" in
add_label)
local label label_id
local label label_id http_code resp
label=$(jq -r ".[$i].label" "$manifest_file")
label_id=$(curl -sf -H "Authorization: token ${FORGE_TOKEN}" \
"${FORGE_API}/labels" | jq -r --arg n "$label" \
'.[] | select(.name == $n) | .id') || true
if [ -n "$label_id" ]; then
if curl -sf -X POST -H "Authorization: token ${FORGE_TOKEN}" \
resp=$(curl -sf -w "\n%{http_code}" -X POST -H "Authorization: token ${FORGE_TOKEN}" \
-H 'Content-Type: application/json' \
"${FORGE_API}/issues/${issue}/labels" \
-d "{\"labels\":[${label_id}]}" >/dev/null 2>&1; then
-d "{\"labels\":[${label_id}]}" 2>/dev/null) || true
http_code=$(echo "$resp" | tail -1)
if [ "$http_code" = "200" ] || [ "$http_code" = "201" ]; then
log "manifest: add_label '${label}' to #${issue}"
else
log "manifest: FAILED add_label '${label}' to #${issue}"
log "manifest: FAILED add_label '${label}' to #${issue}: HTTP ${http_code}"
fi
else
log "manifest: FAILED add_label — label '${label}' not found"
@ -176,17 +179,19 @@ _gardener_execute_manifest() {
;;
remove_label)
local label label_id
local label label_id http_code resp
label=$(jq -r ".[$i].label" "$manifest_file")
label_id=$(curl -sf -H "Authorization: token ${FORGE_TOKEN}" \
"${FORGE_API}/labels" | jq -r --arg n "$label" \
'.[] | select(.name == $n) | .id') || true
if [ -n "$label_id" ]; then
if curl -sf -X DELETE -H "Authorization: token ${FORGE_TOKEN}" \
"${FORGE_API}/issues/${issue}/labels/${label_id}" >/dev/null 2>&1; then
resp=$(curl -sf -w "\n%{http_code}" -X DELETE -H "Authorization: token ${FORGE_TOKEN}" \
"${FORGE_API}/issues/${issue}/labels/${label_id}" 2>/dev/null) || true
http_code=$(echo "$resp" | tail -1)
if [ "$http_code" = "200" ] || [ "$http_code" = "204" ]; then
log "manifest: remove_label '${label}' from #${issue}"
else
log "manifest: FAILED remove_label '${label}' from #${issue}"
log "manifest: FAILED remove_label '${label}' from #${issue}: HTTP ${http_code}"
fi
else
log "manifest: FAILED remove_label — label '${label}' not found"
@ -194,34 +199,38 @@ _gardener_execute_manifest() {
;;
close)
local reason
local reason http_code resp
reason=$(jq -r ".[$i].reason // empty" "$manifest_file")
if curl -sf -X PATCH -H "Authorization: token ${FORGE_TOKEN}" \
resp=$(curl -sf -w "\n%{http_code}" -X PATCH -H "Authorization: token ${FORGE_TOKEN}" \
-H 'Content-Type: application/json' \
"${FORGE_API}/issues/${issue}" \
-d '{"state":"closed"}' >/dev/null 2>&1; then
-d '{"state":"closed"}' 2>/dev/null) || true
http_code=$(echo "$resp" | tail -1)
if [ "$http_code" = "200" ] || [ "$http_code" = "204" ]; then
log "manifest: closed #${issue} (${reason})"
else
log "manifest: FAILED close #${issue}"
log "manifest: FAILED close #${issue}: HTTP ${http_code}"
fi
;;
comment)
local body escaped_body
local body escaped_body http_code resp
body=$(jq -r ".[$i].body" "$manifest_file")
escaped_body=$(printf '%s' "$body" | jq -Rs '.')
if curl -sf -X POST -H "Authorization: token ${FORGE_TOKEN}" \
resp=$(curl -sf -w "\n%{http_code}" -X POST -H "Authorization: token ${FORGE_TOKEN}" \
-H 'Content-Type: application/json' \
"${FORGE_API}/issues/${issue}/comments" \
-d "{\"body\":${escaped_body}}" >/dev/null 2>&1; then
-d "{\"body\":${escaped_body}}" 2>/dev/null) || true
http_code=$(echo "$resp" | tail -1)
if [ "$http_code" = "200" ] || [ "$http_code" = "201" ]; then
log "manifest: commented on #${issue}"
else
log "manifest: FAILED comment on #${issue}"
log "manifest: FAILED comment on #${issue}: HTTP ${http_code}"
fi
;;
create_issue)
local title body labels escaped_title escaped_body label_ids
local title body labels escaped_title escaped_body label_ids http_code resp
title=$(jq -r ".[$i].title" "$manifest_file")
body=$(jq -r ".[$i].body" "$manifest_file")
labels=$(jq -r ".[$i].labels // [] | .[]" "$manifest_file")
@ -241,40 +250,46 @@ _gardener_execute_manifest() {
done <<< "$labels"
[ -n "$ids_json" ] && label_ids="[${ids_json}]"
fi
if curl -sf -X POST -H "Authorization: token ${FORGE_TOKEN}" \
resp=$(curl -sf -w "\n%{http_code}" -X POST -H "Authorization: token ${FORGE_TOKEN}" \
-H 'Content-Type: application/json' \
"${FORGE_API}/issues" \
-d "{\"title\":${escaped_title},\"body\":${escaped_body},\"labels\":${label_ids}}" >/dev/null 2>&1; then
-d "{\"title\":${escaped_title},\"body\":${escaped_body},\"labels\":${label_ids}}" 2>/dev/null) || true
http_code=$(echo "$resp" | tail -1)
if [ "$http_code" = "200" ] || [ "$http_code" = "201" ]; then
log "manifest: created issue '${title}'"
else
log "manifest: FAILED create_issue '${title}'"
log "manifest: FAILED create_issue '${title}': HTTP ${http_code}"
fi
;;
edit_body)
local body escaped_body
local body escaped_body http_code resp
body=$(jq -r ".[$i].body" "$manifest_file")
escaped_body=$(printf '%s' "$body" | jq -Rs '.')
if curl -sf -X PATCH -H "Authorization: token ${FORGE_TOKEN}" \
resp=$(curl -sf -w "\n%{http_code}" -X PATCH -H "Authorization: token ${FORGE_TOKEN}" \
-H 'Content-Type: application/json' \
"${FORGE_API}/issues/${issue}" \
-d "{\"body\":${escaped_body}}" >/dev/null 2>&1; then
-d "{\"body\":${escaped_body}}" 2>/dev/null) || true
http_code=$(echo "$resp" | tail -1)
if [ "$http_code" = "200" ] || [ "$http_code" = "204" ]; then
log "manifest: edited body of #${issue}"
else
log "manifest: FAILED edit_body #${issue}"
log "manifest: FAILED edit_body #${issue}: HTTP ${http_code}"
fi
;;
close_pr)
local pr
local pr http_code resp
pr=$(jq -r ".[$i].pr" "$manifest_file")
if curl -sf -X PATCH -H "Authorization: token ${FORGE_TOKEN}" \
resp=$(curl -sf -w "\n%{http_code}" -X PATCH -H "Authorization: token ${FORGE_TOKEN}" \
-H 'Content-Type: application/json' \
"${FORGE_API}/pulls/${pr}" \
-d '{"state":"closed"}' >/dev/null 2>&1; then
-d '{"state":"closed"}' 2>/dev/null) || true
http_code=$(echo "$resp" | tail -1)
if [ "$http_code" = "200" ] || [ "$http_code" = "204" ]; then
log "manifest: closed PR #${pr}"
else
log "manifest: FAILED close_pr #${pr}"
log "manifest: FAILED close_pr #${pr}: HTTP ${http_code}"
fi
;;

View file

@ -52,9 +52,11 @@ agent_run() {
log "agent_run: starting (resume=${resume_id:-(new)}, dir=${run_dir})"
output=$(cd "$run_dir" && flock -w 600 "$lock_file" timeout "${CLAUDE_TIMEOUT:-7200}" claude "${args[@]}" 2>>"$LOGFILE") && rc=0 || rc=$?
if [ "$rc" -eq 124 ]; then
log "agent_run: timeout after ${CLAUDE_TIMEOUT:-7200}s"
log "agent_run: TIMEOUT after ${CLAUDE_TIMEOUT:-7200}s (exit code $rc)"
elif [ "$rc" -ne 0 ]; then
log "agent_run: claude exited with code $rc"
local tail_lines
tail_lines=$(echo "$output" | tail -3)
log "agent_run: FAILED with exit code $rc: ${tail_lines:-no output}"
fi
if [ -z "$output" ]; then
log "agent_run: empty output (claude may have crashed or failed)"
@ -89,9 +91,11 @@ agent_run() {
local nudge_rc
output=$(cd "$run_dir" && flock -w 600 "$lock_file" timeout "${CLAUDE_TIMEOUT:-7200}" claude -p "$nudge" --resume "$_AGENT_SESSION_ID" --output-format json --dangerously-skip-permissions --max-turns 50 ${CLAUDE_MODEL:+--model "$CLAUDE_MODEL"} 2>>"$LOGFILE") && nudge_rc=0 || nudge_rc=$?
if [ "$nudge_rc" -eq 124 ]; then
log "agent_run: nudge timeout after ${CLAUDE_TIMEOUT:-7200}s"
log "agent_run: nudge TIMEOUT after ${CLAUDE_TIMEOUT:-7200}s (exit code $nudge_rc)"
elif [ "$nudge_rc" -ne 0 ]; then
log "agent_run: nudge claude exited with code $nudge_rc"
local nudge_tail
nudge_tail=$(echo "$output" | tail -3)
log "agent_run: nudge FAILED with exit code $nudge_rc: ${nudge_tail:-no output}"
fi
new_sid=$(printf '%s' "$output" | jq -r '.session_id // empty' 2>/dev/null) || true
if [ -n "$new_sid" ]; then

View file

@ -138,8 +138,12 @@ unset CLAWHUB_TOKEN 2>/dev/null || true
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
# Shared log helper
# Usage: log "message"
# Sets LOG_AGENT to control the agent name prefix (default: derived from SCRIPT_DIR)
# Output format: [2026-04-03T14:00:00Z] agent: message
log() {
printf '[%s] %s\n' "$(date -u '+%Y-%m-%d %H:%M:%S UTC')" "$*"
local agent="${LOG_AGENT:-$(basename "$(dirname "$(dirname "${BASH_SOURCE[0]}")")")}"
printf '[%s] %s: %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "$agent" "$*"
}
# =============================================================================

View file

@ -357,11 +357,19 @@ pr_close() {
local pr_num="$1"
_prl_log "closing PR #${pr_num}"
curl -sf -X PATCH \
local resp http_code
resp=$(curl -sf -w "\n%{http_code}" -X PATCH \
-H "Authorization: token ${FORGE_TOKEN}" \
-H "Content-Type: application/json" \
"${FORGE_API}/pulls/${pr_num}" \
-d '{"state":"closed"}' >/dev/null 2>&1 || true
-d '{"state":"closed"}' 2>/dev/null) || true
http_code=$(printf '%s\n' "$resp" | tail -1)
if [ "$http_code" != "200" ] && [ "$http_code" != "204" ]; then
local body
body=$(printf '%s\n' "$resp" | sed '$d' | head -1)
_prl_log "pr_close FAILED for PR #${pr_num}: HTTP ${http_code} ${body:0:200}"
return 1
fi
}
# ---------------------------------------------------------------------------
@ -398,11 +406,17 @@ pr_walk_to_merge() {
if [ "${_PR_CI_FAILURE_TYPE:-}" = "infra" ] && [ "$ci_retry_count" -lt 1 ]; then
ci_retry_count=$((ci_retry_count + 1))
_prl_log "infra failure — retriggering CI (retry ${ci_retry_count})"
( cd "$worktree" && \
local rebase_output rebase_rc
rebase_output=$( ( cd "$worktree" && \
git commit --allow-empty -m "ci: retrigger after infra failure" --no-verify && \
git fetch "$remote" "${PRIMARY_BRANCH}" 2>/dev/null && \
git rebase "${remote}/${PRIMARY_BRANCH}" && \
git push --force-with-lease "$remote" HEAD ) 2>&1 | tail -5 || true
git push --force-with-lease "$remote" HEAD ) 2>&1 ) || rebase_rc=$?
if [ -n "$rebase_rc" ] && [ "$rebase_rc" -ne 0 ]; then
_prl_log "infra retrigger FAILED (exit code $rebase_rc): $(echo "$rebase_output" | tail -3)"
else
_prl_log "infra retrigger succeeded"
fi
continue
fi

View file

@ -35,7 +35,7 @@ source "$FACTORY_ROOT/lib/guard.sh"
# shellcheck source=../lib/agent-sdk.sh
source "$FACTORY_ROOT/lib/agent-sdk.sh"
LOG_FILE="$SCRIPT_DIR/planner.log"
LOG_FILE="${DISINTO_LOG_DIR}/planner/planner.log"
# shellcheck disable=SC2034 # consumed by agent-sdk.sh
LOGFILE="$LOG_FILE"
# shellcheck disable=SC2034 # consumed by agent-sdk.sh
@ -43,7 +43,8 @@ SID_FILE="/tmp/planner-session-${PROJECT_NAME}.sid"
SCRATCH_FILE="/tmp/planner-${PROJECT_NAME}-scratch.md"
WORKTREE="/tmp/${PROJECT_NAME}-planner-run"
log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%S)Z] $*" >> "$LOG_FILE"; }
# Override LOG_AGENT for consistent agent identification
LOG_AGENT="planner"
# ── Guards ────────────────────────────────────────────────────────────────
check_active planner

View file

@ -36,7 +36,7 @@ source "$FACTORY_ROOT/lib/guard.sh"
# shellcheck source=../lib/agent-sdk.sh
source "$FACTORY_ROOT/lib/agent-sdk.sh"
LOG_FILE="$SCRIPT_DIR/predictor.log"
LOG_FILE="${DISINTO_LOG_DIR}/predictor/predictor.log"
# shellcheck disable=SC2034 # consumed by agent-sdk.sh
LOGFILE="$LOG_FILE"
# shellcheck disable=SC2034 # consumed by agent-sdk.sh
@ -44,7 +44,8 @@ SID_FILE="/tmp/predictor-session-${PROJECT_NAME}.sid"
SCRATCH_FILE="/tmp/predictor-${PROJECT_NAME}-scratch.md"
WORKTREE="/tmp/${PROJECT_NAME}-predictor-run"
log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%S)Z] $*" >> "$LOG_FILE"; }
# Override LOG_AGENT for consistent agent identification
LOG_AGENT="predictor"
# ── Guards ────────────────────────────────────────────────────────────────
check_active predictor

View file

@ -23,20 +23,19 @@ LOGFILE="${DISINTO_LOG_DIR}/review/review-poll.log"
MAX_REVIEWS=3
REVIEW_IDLE_TIMEOUT=14400 # 4h: kill review session if idle
log() {
printf '[%s] %s\n' "$(date -u '+%Y-%m-%d %H:%M:%S UTC')" "$*" >> "$LOGFILE"
}
# Override LOG_AGENT for consistent agent identification
LOG_AGENT="review"
# Log rotation
if [ -f "$LOGFILE" ]; then
LOGSIZE=$(stat -c%s "$LOGFILE" 2>/dev/null || echo 0)
if [ "$LOGSIZE" -gt 102400 ]; then
mv "$LOGFILE" "$LOGFILE.old"
log "Log rotated"
log "Log rotated" >> "$LOGFILE"
fi
fi
log "--- Poll start ---"
log "--- Poll start ---" >> "$LOGFILE"
# --- Clean up stale review sessions (.sid files + worktrees) ---
# Remove .sid files, phase files, and worktrees for merged/closed PRs or idle > 4h
@ -54,7 +53,7 @@ if [ -n "$REVIEW_SIDS" ]; then
"${API_BASE}/pulls/${pr_num}" | jq -r '.state // "unknown"' 2>/dev/null) || true
if [ "$pr_state" != "open" ]; then
log "cleanup: PR #${pr_num} state=${pr_state} — removing sid/worktree"
log "cleanup: PR #${pr_num} state=${pr_state} — removing sid/worktree" >> "$LOGFILE"
rm -f "$sid_file" "$phase_file" "/tmp/${PROJECT_NAME}-review-output-${pr_num}.json"
cd "$REPO_ROOT"
git worktree remove "$worktree" --force 2>/dev/null || true
@ -66,7 +65,7 @@ if [ -n "$REVIEW_SIDS" ]; then
sid_mtime=$(stat -c %Y "$sid_file" 2>/dev/null || echo 0)
now=$(date +%s)
if [ "$sid_mtime" -gt 0 ] && [ $(( now - sid_mtime )) -gt "$REVIEW_IDLE_TIMEOUT" ]; then
log "cleanup: PR #${pr_num} idle > 4h — removing sid/worktree"
log "cleanup: PR #${pr_num} idle > 4h — removing sid/worktree" >> "$LOGFILE"
rm -f "$sid_file" "$phase_file" "/tmp/${PROJECT_NAME}-review-output-${pr_num}.json"
cd "$REPO_ROOT"
git worktree remove "$worktree" --force 2>/dev/null || true
@ -86,7 +85,7 @@ if [ -z "$PRS" ]; then
fi
TOTAL=$(echo "$PRS" | wc -l)
log "Found ${TOTAL} open PRs"
log "Found ${TOTAL} open PRs" >> "$LOGFILE"
REVIEWED=0
SKIPPED=0
@ -119,17 +118,19 @@ if [ -n "$REVIEW_SIDS" ]; then
if ! ci_passed "$ci_state"; then
if ci_required_for_pr "$pr_num"; then
log " #${pr_num} awaiting_changes: new SHA ${current_sha:0:7} CI=${ci_state}, waiting"
log " #${pr_num} awaiting_changes: new SHA ${current_sha:0:7} CI=${ci_state}, waiting" >> "$LOGFILE"
continue
fi
fi
log " #${pr_num} re-review: new commits (${reviewed_sha:0:7}${current_sha:0:7})"
log " #${pr_num} re-review: new commits (${reviewed_sha:0:7}${current_sha:0:7})" >> "$LOGFILE"
if "${SCRIPT_DIR}/review-pr.sh" "$pr_num" 2>&1; then
review_output=$("${SCRIPT_DIR}/review-pr.sh" "$pr_num" 2>&1) && review_rc=0 || review_rc=$?
if [ "$review_rc" -eq 0 ]; then
REVIEWED=$((REVIEWED + 1))
else
log " #${pr_num} re-review failed"
tail_lines=$(echo "$review_output" | tail -3)
log " #${pr_num} re-review FAILED (exit code $review_rc): ${tail_lines}"
fi
[ "$REVIEWED" -lt "$MAX_REVIEWS" ] || break
@ -145,11 +146,11 @@ while IFS= read -r line; do
# Skip if CI is running/failed. Allow "success", no CI configured, or non-code PRs
if ! ci_passed "$CI_STATE"; then
if ci_required_for_pr "$PR_NUM"; then
log " #${PR_NUM} CI=${CI_STATE}, skip"
log " #${PR_NUM} CI=${CI_STATE}, skip" >> "$LOGFILE"
SKIPPED=$((SKIPPED + 1))
continue
fi
log " #${PR_NUM} CI=${CI_STATE} but no code files — proceeding"
log " #${PR_NUM} CI=${CI_STATE} but no code files — proceeding" >> "$LOGFILE"
fi
# Check formal forge reviews (not comment markers)
@ -159,12 +160,12 @@ while IFS= read -r line; do
'[.[] | select(.commit_id == $sha) | select(.state != "COMMENT")] | length')
if [ "${HAS_REVIEW:-0}" -gt "0" ]; then
log " #${PR_NUM} formal review exists for ${PR_SHA:0:7}, skip"
log " #${PR_NUM} formal review exists for ${PR_SHA:0:7}, skip" >> "$LOGFILE"
SKIPPED=$((SKIPPED + 1))
continue
fi
log " #${PR_NUM} needs review (CI=success, SHA=${PR_SHA:0:7})"
log " #${PR_NUM} needs review (CI=success, SHA=${PR_SHA:0:7})" >> "$LOGFILE"
# Circuit breaker: count existing review-error comments for this SHA
ERROR_COMMENTS=$(curl -sf -H "Authorization: token ${FORGE_TOKEN}" \
@ -173,21 +174,23 @@ while IFS= read -r line; do
'[.[] | select(.body | contains("<!-- review-error: " + $sha + " -->"))] | length')
if [ "${ERROR_COMMENTS:-0}" -ge 3 ]; then
log " #${PR_NUM} blocked: ${ERROR_COMMENTS} consecutive error comments for ${PR_SHA:0:7}, skipping"
log " #${PR_NUM} blocked: ${ERROR_COMMENTS} consecutive error comments for ${PR_SHA:0:7}, skipping" >> "$LOGFILE"
SKIPPED=$((SKIPPED + 1))
continue
fi
log " #${PR_NUM} error check: ${ERROR_COMMENTS:-0} prior error(s) for ${PR_SHA:0:7}"
log " #${PR_NUM} error check: ${ERROR_COMMENTS:-0} prior error(s) for ${PR_SHA:0:7}" >> "$LOGFILE"
if "${SCRIPT_DIR}/review-pr.sh" "$PR_NUM" 2>&1; then
review_output=$("${SCRIPT_DIR}/review-pr.sh" "$PR_NUM" 2>&1) && review_rc=0 || review_rc=$?
if [ "$review_rc" -eq 0 ]; then
REVIEWED=$((REVIEWED + 1))
else
log " #${PR_NUM} review failed"
tail_lines=$(echo "$review_output" | tail -3)
log " #${PR_NUM} review FAILED (exit code $review_rc): ${tail_lines}"
fi
if [ "$REVIEWED" -ge "$MAX_REVIEWS" ]; then
log "Hit max reviews (${MAX_REVIEWS}), stopping"
log "Hit max reviews (${MAX_REVIEWS}), stopping" >> "$LOGFILE"
break
fi
@ -195,4 +198,4 @@ while IFS= read -r line; do
done <<< "$PRS"
log "--- Poll done: ${REVIEWED} reviewed, ${SKIPPED} skipped ---"
log "--- Poll done: ${REVIEWED} reviewed, ${SKIPPED} skipped ---" >> "$LOGFILE"

View file

@ -46,7 +46,8 @@ SID_FILE="/tmp/supervisor-session-${PROJECT_NAME}.sid"
SCRATCH_FILE="/tmp/supervisor-${PROJECT_NAME}-scratch.md"
WORKTREE="/tmp/${PROJECT_NAME}-supervisor-run"
log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%S)Z] $*" >> "$LOG_FILE"; }
# Override LOG_AGENT for consistent agent identification
LOG_AGENT="supervisor"
# ── Guards ────────────────────────────────────────────────────────────────
check_active supervisor
@ -67,10 +68,12 @@ resolve_agent_identity || true
# ── Collect pre-flight metrics ────────────────────────────────────────────
log "Running preflight.sh"
PREFLIGHT_OUTPUT=""
PREFLIGHT_RC=0
if PREFLIGHT_OUTPUT=$(bash "$SCRIPT_DIR/preflight.sh" "$PROJECT_TOML" 2>&1); then
log "Preflight collected ($(echo "$PREFLIGHT_OUTPUT" | wc -l) lines)"
else
log "WARNING: preflight.sh failed, continuing with partial data"
PREFLIGHT_RC=$?
log "WARNING: preflight.sh failed (exit code $PREFLIGHT_RC), continuing with partial data"
fi
# ── Load formula + context ───────────────────────────────────────────────