Claude 810d92676c

ci/woodpecker/push/ci Pipeline was successful

Details

ci/woodpecker/pr/ci Pipeline was successful

Details

fix: extend step 8 approval-bias carve-out to include infra files (step 3c), fix count

- Step 8 now explicitly exempts infrastructure file findings (step 3c) from
  the "bias toward APPROVE" guidance, preventing the original failure mode
- Fix investigation summary: "Five" → "Six" structural gaps

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-11 19:50:59 +00:00

5.7 KiB

Raw Blame History

Investigation: Reviewer approved destructive compose rewrite in PR #683

Issue: #685 Date: 2026-04-11 PR under investigation: #683 (fix: config: gardener=1h, architect=9m, planner=11m)

Summary

The reviewer agent approved PR #683 in ~1 minute without flagging that it contained a destructive rewrite of docker-compose.yml — dropping named volumes, bind mounts, env vars, restart policy, and security options. Six structural gaps in the review pipeline allowed this to pass.

Root causes

1. No infrastructure-file-specific review checklist

The review formula (formulas/review-pr.toml) has a generic review checklist (bugs, security, imports, architecture, bash specifics, dead code). It has no special handling for infrastructure files — docker-compose.yml, Dockerfile, CI configs, or entrypoint.sh are reviewed with the same checklist as application code.

Infrastructure files have a different failure mode: a single dropped line (a volume mount, an env var, a restart policy) can break a running deployment without any syntax error or linting failure. The generic checklist doesn't prompt the reviewer to check for these regressions.

Fix applied: Added step 3c "Infrastructure file review" to formulas/review-pr.toml with a compose-specific checklist covering named volumes, bind mounts, env vars, restart policy, and security options.

2. No scope discipline

Issue #682 asked for ~3 env var changes + PLANNER_INTERVAL plumbing — roughly 10-15 lines across 3-4 files. PR #683's diff rewrote the entire compose service block (~50+ lines changed in docker-compose.yml alone).

The review formula does not instruct the reviewer to compare diff size against issue scope. A scope-aware reviewer would flag: "this PR changes more lines than the issue scope warrants — request justification for out-of-scope changes."

Fix applied: Added step 3d "Scope discipline" to formulas/review-pr.toml requiring the reviewer to compare actual changes against stated issue scope and flag out-of-scope modifications to infrastructure files.

3. Lessons-learned bias toward approval

The reviewer's .profile/knowledge/lessons-learned.md contains multiple entries that systematically bias toward approval:

"Approval means 'ready to ship,' not 'perfect.'"
"'Different from how I'd write it' is not a blocker."
"Reserve request_changes for genuinely blocking concerns."

These lessons are well-intentioned (they prevent nit-picking and false blocks) but they create a blind spot: the reviewer suppresses its instinct to flag suspicious-looking changes because the lessons tell it not to block on "taste-based" concerns. A compose service block rewrite looks like a style preference ("the dev reorganized the file") but is actually a correctness regression.

Recommendation: The lessons-learned are not wrong — they should stay. But the review formula now explicitly carves out infrastructure files from the "bias toward APPROVE" guidance, making it clear that dropped infra configuration is a blocking concern, not a style preference.

4. No ground-truth for infrastructure files

The reviewer only sees the diff. It has no way to compare against the running container's actual volume/env config. When dev-qwen rewrote a 30-line service block from scratch, the reviewer saw a 30-line addition and a 30-line deletion with no reference point.

Recommendation (future work): Maintain a docker/expected-compose-config.yml or have the reviewer fetch docker compose config output as ground truth when reviewing compose changes. This would let the reviewer diff the proposed config against the known-good config.

5. Structural analysis blind spot

lib/build-graph.py tracks changes to files in formulas/, agent directories (dev/, review/, etc.), and evidence/. It does not track infrastructure files (docker-compose.yml, docker/, .woodpecker/). Changes to these files produce no alerts in the graph report — the reviewer gets no "affected objectives" signal for infrastructure changes.

Recommendation (future work): Add infrastructure file tracking to build-graph.py so that compose/Dockerfile/CI changes surface in the structural analysis.

6. Model and time budget

Reviews use Sonnet (CLAUDE_MODEL="sonnet" at review-pr.sh:229) with a 15-minute timeout. The PR #683 review completed in ~1 minute. Sonnet is optimized for speed, which is appropriate for most code reviews, but infrastructure changes benefit from the deeper reasoning of a more capable model.

Recommendation (future work): Consider escalating to a more capable model when the diff includes infrastructure files (compose, Dockerfiles, CI configs).

Changes made

formulas/review-pr.toml — Added two new review steps:
- Step 3c: Infrastructure file review — When the diff touches docker-compose.yml, Dockerfile*, .woodpecker/, or docker/, requires checking for dropped volumes, bind mounts, env vars, restart policy, security options, and network config. Instructs the reviewer to read the full file (not just the diff) and compare against the base branch.
- Step 3d: Scope discipline — Requires comparing the actual diff footprint against the stated issue scope. Flags out-of-scope rewrites of infrastructure files as blocking concerns.

What would have caught this

With the changes above, the reviewer would have:

Seen step 3c trigger for docker-compose.yml changes
Read the full compose file and compared against the base branch
Noticed the dropped named volumes, bind mounts, env vars, restart policy
Seen step 3d flag that a 3-env-var issue produced a 50+ line compose rewrite
Issued REQUEST_CHANGES citing specific dropped configuration

5.7 KiB Raw Blame History