investigation: reviewer agent approved destructive compose rewrite in PR #683 — why? #685

New issue

Closed

opened 2026-04-11 19:37:24 +00:00 by dev-bot · 0 comments

dev-bot commented

2026-04-11 19:37:24 +00:00

Collaborator

Description

Investigation: PR #683 was reviewed and APPROVED by the reviewer agent (Claude in disinto-agents) at 2026-04-11 17:59:57 UTC, despite containing a destructive rewrite of docker-compose.yml that would break the production deployment if anyone ran docker compose up -d (see follow-up issue for the compose-file regression). The reviewer should have caught this. Why didn't it?

Reviewer outcome to explain

From /home/agent/data/logs/review/review.log:

[2026-04-11 17:59:56 UTC] PR#683 verdict: APPROVE
[2026-04-11 17:59:57 UTC] PR#683 posting review
[2026-04-11 17:59:57 UTC] PR#683 posted review comment
[2026-04-11 17:59:57 UTC] PR#683 formal APPROVED submitted
[2026-04-11 18:00:10 UTC] PR#683 DONE: APPROVE (re-review: false)

The reviewer Claude session ran for ~1 minute and approved without flagging that the compose-file changes:

Removed restart: unless-stopped and security_opt: apparmor=unconfined
Replaced production named volumes (agent-data, project-repos) with bind mounts to nonexistent paths
Dropped 6+ critical bind mounts (CLAUDE_SHARED_DIR, .claude.json, .ssh, sops age key, woodpecker-data, the versioned claude binary)
Dropped FORGE_PASS, FORGE_PASS_LLAMA, FORGE_TOKEN_LLAMA, WOODPECKER_REPO_ID, FORGE_BOT_USERNAMES
Added a dead DISINTO_AGENTS=review,gardener env var the entrypoint doesn't read

Each of these is a regression that a careful reviewer with context about the running deployment would have flagged.

Things to investigate

1. Did the reviewer see the full diff?

Check /tmp/project-review-graph-681.json-style artifacts for PR #683 — does the graph report include the docker-compose.yml hunks, or did it summarize/elide them?
Check review/review-pr.sh for how the diff is fetched and presented to Claude. Is there a size cap or truncation? Does it pass the full unified diff or a summary?

2. Did the reviewer have ground-truth context?

The reviewer doesn't know what the running container's actual volume/env config is. It only sees the diff. So if dev-qwen rewrote a 30-line block from scratch and the previous block was different, the reviewer sees a 30-line addition and a 30-line deletion — and has no way to compare against docker inspect for truth.
Suggested mitigation: have the reviewer fetch docker inspect for the affected service when reviewing changes to docker-compose.yml, or consult an ops/expected-config.json file as ground truth.

3. What's in the reviewer's prompt / lessons-learned?

Reviewer's dev-bot/.profile/lessons-learned.md is loaded each run (we saw loaded lessons-learned.md (2048 bytes) in the log). What's in there?
Is there guidance like "for compose changes, verify named volumes / restart policy / security_opt aren't dropped"?
If not, this might be the simplest fix: add a compose-file-specific checklist to the reviewer's profile.

4. Did the reviewer apply scope discipline?

Issue #682 asked for: add three env vars + add PLANNER_INTERVAL plumbing. That's roughly: ~3 lines in compose, ~1 line in entrypoint.sh, ~1 line in lib/generators.sh, ~5 lines of doc. Total well under 20 lines.
PR #683's actual diff almost certainly touches >50 lines in docker-compose.yml alone (a service block rewrite).
A scope-aware reviewer should flag: "this PR changes more lines than the issue scope warrants — request justification for out-of-scope changes."
Does review/review-pr.sh (or its prompt) compare diff size against issue scope? If not, that's a structural gap.

5. Was the reviewer rushing?

Session duration: 1 minute (17:58:38 → 17:59:57). For a multi-file PR with non-trivial compose changes, that's fast. Token budget? Turn budget?
Check agent_run log for #683 review session — how many turns, what tools used, what context window utilization?

Suggested investigation steps

Read review/review-pr.sh to understand what gets passed to Claude (full diff vs summary, any caps)
Read /home/agent/data/.profile/dev-bot/lessons-learned.md to see existing reviewer guidance
Find the saved Claude session for the #683 review (probably under ~/.claude/projects/<hash>/<sid>.jsonl) and read what the reviewer actually said and what tools it used
Check review/review.log for the full session summary including turn count and tool calls

Output

A short writeup of root cause + at least one concrete improvement to either:

the reviewer's prompt template
the reviewer's lessons-learned
the diff-presentation logic in review-pr.sh
a new pre-merge guard (e.g., a CI step that diffs docker compose config against the running container's docker inspect and fails on incompatible changes)

Why this matters

The factory's whole self-development premise depends on reviewers catching dev-agent mistakes. Today's incident demonstrates that the reviewer has at least one major class of blind spot — out-of-scope rewrites of infrastructure files. This isn't a one-off; the next time dev-qwen "improves" the compose file or a Dockerfile or a CI config, it'll happen again unless the reviewer learns to defend against it.

## Description Investigation: PR #683 was reviewed and **APPROVED** by the reviewer agent (Claude in `disinto-agents`) at 2026-04-11 17:59:57 UTC, despite containing a destructive rewrite of `docker-compose.yml` that would break the production deployment if anyone ran `docker compose up -d` (see follow-up issue for the compose-file regression). The reviewer should have caught this. Why didn't it? ## Reviewer outcome to explain From `/home/agent/data/logs/review/review.log`: ``` [2026-04-11 17:59:56 UTC] PR#683 verdict: APPROVE [2026-04-11 17:59:57 UTC] PR#683 posting review [2026-04-11 17:59:57 UTC] PR#683 posted review comment [2026-04-11 17:59:57 UTC] PR#683 formal APPROVED submitted [2026-04-11 18:00:10 UTC] PR#683 DONE: APPROVE (re-review: false) ``` The reviewer Claude session ran for ~1 minute and approved without flagging that the compose-file changes: 1. Removed `restart: unless-stopped` and `security_opt: apparmor=unconfined` 2. Replaced production named volumes (`agent-data`, `project-repos`) with bind mounts to nonexistent paths 3. Dropped 6+ critical bind mounts (`CLAUDE_SHARED_DIR`, `.claude.json`, `.ssh`, sops age key, woodpecker-data, the versioned claude binary) 4. Dropped `FORGE_PASS`, `FORGE_PASS_LLAMA`, `FORGE_TOKEN_LLAMA`, `WOODPECKER_REPO_ID`, `FORGE_BOT_USERNAMES` 5. Added a dead `DISINTO_AGENTS=review,gardener` env var the entrypoint doesn't read Each of these is a regression that a careful reviewer with context about the running deployment would have flagged. ## Things to investigate ### 1. Did the reviewer see the full diff? - Check `/tmp/project-review-graph-681.json`-style artifacts for PR #683 — does the graph report include the docker-compose.yml hunks, or did it summarize/elide them? - Check `review/review-pr.sh` for how the diff is fetched and presented to Claude. Is there a size cap or truncation? Does it pass the full unified diff or a summary? ### 2. Did the reviewer have ground-truth context? - The reviewer doesn't know what the *running* container's actual volume/env config is. It only sees the diff. So if dev-qwen rewrote a 30-line block from scratch and the previous block was different, the reviewer sees a 30-line addition and a 30-line deletion — and has no way to compare against `docker inspect` for truth. - **Suggested mitigation**: have the reviewer fetch `docker inspect` for the affected service when reviewing changes to `docker-compose.yml`, or consult an `ops/expected-config.json` file as ground truth. ### 3. What's in the reviewer's prompt / lessons-learned? - Reviewer's `dev-bot/.profile/lessons-learned.md` is loaded each run (we saw `loaded lessons-learned.md (2048 bytes)` in the log). What's in there? - Is there guidance like "for compose changes, verify named volumes / restart policy / security_opt aren't dropped"? - If not, this might be the simplest fix: add a compose-file-specific checklist to the reviewer's profile. ### 4. Did the reviewer apply scope discipline? - Issue #682 asked for: add three env vars + add `PLANNER_INTERVAL` plumbing. That's roughly: ~3 lines in compose, ~1 line in entrypoint.sh, ~1 line in lib/generators.sh, ~5 lines of doc. Total well under 20 lines. - PR #683's actual diff almost certainly touches >50 lines in docker-compose.yml alone (a service block rewrite). - **A scope-aware reviewer should flag**: "this PR changes more lines than the issue scope warrants — request justification for out-of-scope changes." - Does `review/review-pr.sh` (or its prompt) compare diff size against issue scope? If not, that's a structural gap. ### 5. Was the reviewer rushing? - Session duration: 1 minute (17:58:38 → 17:59:57). For a multi-file PR with non-trivial compose changes, that's fast. Token budget? Turn budget? - Check `agent_run` log for #683 review session — how many turns, what tools used, what context window utilization? ## Suggested investigation steps 1. Read `review/review-pr.sh` to understand what gets passed to Claude (full diff vs summary, any caps) 2. Read `/home/agent/data/.profile/dev-bot/lessons-learned.md` to see existing reviewer guidance 3. Find the saved Claude session for the #683 review (probably under `~/.claude/projects/<hash>/<sid>.jsonl`) and read what the reviewer actually said and what tools it used 4. Check `review/review.log` for the full session summary including turn count and tool calls ## Output A short writeup of root cause + at least one concrete improvement to either: - the reviewer's prompt template - the reviewer's lessons-learned - the diff-presentation logic in `review-pr.sh` - a new pre-merge guard (e.g., a CI step that diffs `docker compose config` against the running container's `docker inspect` and fails on incompatible changes) ## Why this matters The factory's whole self-development premise depends on reviewers catching dev-agent mistakes. Today's incident demonstrates that the reviewer has at least one major class of blind spot — *out-of-scope rewrites of infrastructure files*. This isn't a one-off; the next time dev-qwen "improves" the compose file or a Dockerfile or a CI config, it'll happen again unless the reviewer learns to defend against it.