- Step 8 now explicitly exempts infrastructure file findings (step 3c) from the "bias toward APPROVE" guidance, preventing the original failure mode - Fix investigation summary: "Five" → "Six" structural gaps Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.7 KiB
Investigation: Reviewer approved destructive compose rewrite in PR #683
Issue: #685 Date: 2026-04-11 PR under investigation: #683 (fix: config: gardener=1h, architect=9m, planner=11m)
Summary
The reviewer agent approved PR #683 in ~1 minute without flagging that it
contained a destructive rewrite of docker-compose.yml — dropping named
volumes, bind mounts, env vars, restart policy, and security options. Six
structural gaps in the review pipeline allowed this to pass.
Root causes
1. No infrastructure-file-specific review checklist
The review formula (formulas/review-pr.toml) has a generic review checklist
(bugs, security, imports, architecture, bash specifics, dead code). It has
no special handling for infrastructure files — docker-compose.yml,
Dockerfile, CI configs, or entrypoint.sh are reviewed with the same
checklist as application code.
Infrastructure files have a different failure mode: a single dropped line (a volume mount, an env var, a restart policy) can break a running deployment without any syntax error or linting failure. The generic checklist doesn't prompt the reviewer to check for these regressions.
Fix applied: Added step 3c "Infrastructure file review" to
formulas/review-pr.toml with a compose-specific checklist covering named
volumes, bind mounts, env vars, restart policy, and security options.
2. No scope discipline
Issue #682 asked for ~3 env var changes + PLANNER_INTERVAL plumbing — roughly
10-15 lines across 3-4 files. PR #683's diff rewrote the entire compose service
block (~50+ lines changed in docker-compose.yml alone).
The review formula does not instruct the reviewer to compare diff size against issue scope. A scope-aware reviewer would flag: "this PR changes more lines than the issue scope warrants — request justification for out-of-scope changes."
Fix applied: Added step 3d "Scope discipline" to formulas/review-pr.toml
requiring the reviewer to compare actual changes against stated issue scope and
flag out-of-scope modifications to infrastructure files.
3. Lessons-learned bias toward approval
The reviewer's .profile/knowledge/lessons-learned.md contains multiple entries
that systematically bias toward approval:
- "Approval means 'ready to ship,' not 'perfect.'"
- "'Different from how I'd write it' is not a blocker."
- "Reserve request_changes for genuinely blocking concerns."
These lessons are well-intentioned (they prevent nit-picking and false blocks) but they create a blind spot: the reviewer suppresses its instinct to flag suspicious-looking changes because the lessons tell it not to block on "taste-based" concerns. A compose service block rewrite looks like a style preference ("the dev reorganized the file") but is actually a correctness regression.
Recommendation: The lessons-learned are not wrong — they should stay. But the review formula now explicitly carves out infrastructure files from the "bias toward APPROVE" guidance, making it clear that dropped infra configuration is a blocking concern, not a style preference.
4. No ground-truth for infrastructure files
The reviewer only sees the diff. It has no way to compare against the running container's actual volume/env config. When dev-qwen rewrote a 30-line service block from scratch, the reviewer saw a 30-line addition and a 30-line deletion with no reference point.
Recommendation (future work): Maintain a docker/expected-compose-config.yml
or have the reviewer fetch docker compose config output as ground truth when
reviewing compose changes. This would let the reviewer diff the proposed config
against the known-good config.
5. Structural analysis blind spot
lib/build-graph.py tracks changes to files in formulas/, agent directories
(dev/, review/, etc.), and evidence/. It does not track infrastructure
files (docker-compose.yml, docker/, .woodpecker/). Changes to these
files produce no alerts in the graph report — the reviewer gets no
"affected objectives" signal for infrastructure changes.
Recommendation (future work): Add infrastructure file tracking to
build-graph.py so that compose/Dockerfile/CI changes surface in the
structural analysis.
6. Model and time budget
Reviews use Sonnet (CLAUDE_MODEL="sonnet" at review-pr.sh:229) with a
15-minute timeout. The PR #683 review completed in ~1 minute. Sonnet is
optimized for speed, which is appropriate for most code reviews, but
infrastructure changes benefit from the deeper reasoning of a more capable
model.
Recommendation (future work): Consider escalating to a more capable model when the diff includes infrastructure files (compose, Dockerfiles, CI configs).
Changes made
formulas/review-pr.toml— Added two new review steps:- Step 3c: Infrastructure file review — When the diff touches
docker-compose.yml,Dockerfile*,.woodpecker/, ordocker/, requires checking for dropped volumes, bind mounts, env vars, restart policy, security options, and network config. Instructs the reviewer to read the full file (not just the diff) and compare against the base branch. - Step 3d: Scope discipline — Requires comparing the actual diff footprint against the stated issue scope. Flags out-of-scope rewrites of infrastructure files as blocking concerns.
- Step 3c: Infrastructure file review — When the diff touches
What would have caught this
With the changes above, the reviewer would have:
- Seen step 3c trigger for
docker-compose.ymlchanges - Read the full compose file and compared against the base branch
- Noticed the dropped named volumes, bind mounts, env vars, restart policy
- Seen step 3d flag that a 3-env-var issue produced a 50+ line compose rewrite
- Issued REQUEST_CHANGES citing specific dropped configuration