fix: investigation: reviewer agent approved destructive compose rewrite in PR #683 — why? (#685)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful

Root cause: review formula had no infrastructure-file-specific checklist and
no scope discipline check. The reviewer treated a docker-compose.yml rewrite
the same as any code change, and lessons-learned biased toward approval.

Changes:
- Add step 3c (infrastructure file review) to formulas/review-pr.toml:
  compose-specific checklist for volumes, bind mounts, env vars, restart
  policy, security options
- Add step 3d (scope discipline) to formulas/review-pr.toml: compare
  actual diff size against issue scope, block on infra-file scope violations
- Add investigation writeup in docs/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude 2026-04-11 19:44:30 +00:00
parent 526928dca8
commit 527731da53
2 changed files with 181 additions and 0 deletions

View file

@ -0,0 +1,123 @@
# Investigation: Reviewer approved destructive compose rewrite in PR #683
**Issue**: #685
**Date**: 2026-04-11
**PR under investigation**: #683 (fix: config: gardener=1h, architect=9m, planner=11m)
## Summary
The reviewer agent approved PR #683 in ~1 minute without flagging that it
contained a destructive rewrite of `docker-compose.yml` — dropping named
volumes, bind mounts, env vars, restart policy, and security options. Five
structural gaps in the review pipeline allowed this to pass.
## Root causes
### 1. No infrastructure-file-specific review checklist
The review formula (`formulas/review-pr.toml`) has a generic review checklist
(bugs, security, imports, architecture, bash specifics, dead code). It has
**no special handling for infrastructure files** — `docker-compose.yml`,
`Dockerfile`, CI configs, or `entrypoint.sh` are reviewed with the same
checklist as application code.
Infrastructure files have a different failure mode: a single dropped line
(a volume mount, an env var, a restart policy) can break a running deployment
without any syntax error or linting failure. The generic checklist doesn't
prompt the reviewer to check for these regressions.
**Fix applied**: Added step 3c "Infrastructure file review" to
`formulas/review-pr.toml` with a compose-specific checklist covering named
volumes, bind mounts, env vars, restart policy, and security options.
### 2. No scope discipline
Issue #682 asked for ~3 env var changes + `PLANNER_INTERVAL` plumbing — roughly
10-15 lines across 3-4 files. PR #683's diff rewrote the entire compose service
block (~50+ lines changed in `docker-compose.yml` alone).
The review formula **does not instruct the reviewer to compare diff size against
issue scope**. A scope-aware reviewer would flag: "this PR changes more lines
than the issue scope warrants — request justification for out-of-scope changes."
**Fix applied**: Added step 3d "Scope discipline" to `formulas/review-pr.toml`
requiring the reviewer to compare actual changes against stated issue scope and
flag out-of-scope modifications to infrastructure files.
### 3. Lessons-learned bias toward approval
The reviewer's `.profile/knowledge/lessons-learned.md` contains multiple entries
that systematically bias toward approval:
- "Approval means 'ready to ship,' not 'perfect.'"
- "'Different from how I'd write it' is not a blocker."
- "Reserve request_changes for genuinely blocking concerns."
These lessons are well-intentioned (they prevent nit-picking and false blocks)
but they create a blind spot: the reviewer suppresses its instinct to flag
suspicious-looking changes because the lessons tell it not to block on
"taste-based" concerns. A compose service block rewrite *looks* like a style
preference ("the dev reorganized the file") but is actually a correctness
regression.
**Recommendation**: The lessons-learned are not wrong — they should stay. But
the review formula now explicitly carves out infrastructure files from the
"bias toward APPROVE" guidance, making it clear that dropped infra
configuration is a blocking concern, not a style preference.
### 4. No ground-truth for infrastructure files
The reviewer only sees the diff. It has no way to compare against the running
container's actual volume/env config. When dev-qwen rewrote a 30-line service
block from scratch, the reviewer saw a 30-line addition and a 30-line deletion
with no reference point.
**Recommendation (future work)**: Maintain a `docker/expected-compose-config.yml`
or have the reviewer fetch `docker compose config` output as ground truth when
reviewing compose changes. This would let the reviewer diff the proposed config
against the known-good config.
### 5. Structural analysis blind spot
`lib/build-graph.py` tracks changes to files in `formulas/`, agent directories
(`dev/`, `review/`, etc.), and `evidence/`. It does **not track infrastructure
files** (`docker-compose.yml`, `docker/`, `.woodpecker/`). Changes to these
files produce no alerts in the graph report — the reviewer gets no
"affected objectives" signal for infrastructure changes.
**Recommendation (future work)**: Add infrastructure file tracking to
`build-graph.py` so that compose/Dockerfile/CI changes surface in the
structural analysis.
### 6. Model and time budget
Reviews use Sonnet (`CLAUDE_MODEL="sonnet"` at `review-pr.sh:229`) with a
15-minute timeout. The PR #683 review completed in ~1 minute. Sonnet is
optimized for speed, which is appropriate for most code reviews, but
infrastructure changes benefit from the deeper reasoning of a more capable
model.
**Recommendation (future work)**: Consider escalating to a more capable model
when the diff includes infrastructure files (compose, Dockerfiles, CI configs).
## Changes made
1. **`formulas/review-pr.toml`** — Added two new review steps:
- **Step 3c: Infrastructure file review** — When the diff touches
`docker-compose.yml`, `Dockerfile*`, `.woodpecker/`, or `docker/`,
requires checking for dropped volumes, bind mounts, env vars, restart
policy, security options, and network config. Instructs the reviewer to
read the full file (not just the diff) and compare against the base branch.
- **Step 3d: Scope discipline** — Requires comparing the actual diff
footprint against the stated issue scope. Flags out-of-scope rewrites of
infrastructure files as blocking concerns.
## What would have caught this
With the changes above, the reviewer would have:
1. Seen step 3c trigger for `docker-compose.yml` changes
2. Read the full compose file and compared against the base branch
3. Noticed the dropped named volumes, bind mounts, env vars, restart policy
4. Seen step 3d flag that a 3-env-var issue produced a 50+ line compose rewrite
5. Issued REQUEST_CHANGES citing specific dropped configuration

View file

@ -80,6 +80,64 @@ For each BEHAVIORAL change in the diff (not pure bug fixes or formatting):
This check is SKIPPED for pure bug fixes where the intended behavior is
unchanged (the code was wrong, not the documentation).
## 3c. Infrastructure file review (conditional)
If the diff touches ANY of these files, apply this additional checklist:
- `docker-compose.yml` or `docker-compose.*.yml`
- `Dockerfile` or `docker/*`
- `.woodpecker/` CI configs
- `docker/agents/entrypoint.sh`
Infrastructure files have a different failure mode from application code:
a single dropped line (a volume mount, an env var, a restart policy) can
break a running deployment with no syntax error. Treat dropped
infrastructure configuration as a **blocking defect**, not a style choice.
### For docker-compose.yml changes:
1. **Read the full file** in the PR branch do not rely only on the diff.
2. Run `git diff <base>..HEAD -- docker-compose.yml` to see the complete
change, not just the truncated diff.
3. Check that NONE of the following were dropped without explicit
justification in the PR description:
- Named volumes (e.g. `agent-data`, `project-repos`)
- Bind mounts (especially for config, secrets, SSH keys, shared dirs)
- Environment variables (compare the full `environment:` block against
the base branch)
- `restart:` policy (should be `unless-stopped` for production services)
- `security_opt:` settings
- Network configuration
- Resource limits / deploy constraints
4. If ANY production configuration was dropped and the PR description does
not explain why, **REQUEST_CHANGES**. List each dropped item explicitly.
### For Dockerfile / entrypoint changes:
1. Check that base image, installed packages, and runtime deps are preserved.
2. Verify that entrypoint/CMD changes don't break the container startup.
### For CI config changes:
1. Check that pipeline steps aren't silently removed.
2. Verify that secret references still match available secrets.
## 3d. Scope discipline
Compare the actual diff footprint against the stated issue scope:
1. Read the PR title and description to identify what the issue asked for.
2. Estimate the expected diff size (e.g., "add 3 env vars" = ~5-10 lines
in compose + ~5 lines in scripts).
3. If the actual diff in ANY single file exceeds 3x the expected scope,
flag it: "this file changed N lines but the issue scope suggests ~M."
For infrastructure files (compose, Dockerfiles, CI), scope violations are
**blocking**: REQUEST_CHANGES and ask the author to split out-of-scope
changes into a separate PR or justify them in the description.
For non-infrastructure files, scope violations are advisory: leave a
non-blocking COMMENT noting the scope creep.
## 4. Vault item quality (conditional)
If the PR adds or modifies vault item files (`vault/pending/*.md` in the ops repo), apply these