fix: investigation: reviewer agent approved destructive compose rewrite in PR #683 — why? (#685)

Root cause: review formula had no infrastructure-file-specific checklist and no scope discipline check. The reviewer treated a docker-compose.yml rewrite the same as any code change, and lessons-learned biased toward approval. Changes: - Add step 3c (infrastructure file review) to formulas/review-pr.toml: compose-specific checklist for volumes, bind mounts, env vars, restart policy, security options - Add step 3d (scope discipline) to formulas/review-pr.toml: compare actual diff size against issue scope, block on infra-file scope violations - Add investigation writeup in docs/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 19:44:30 +00:00 · 2026-04-11 19:44:30 +00:00 · 527731da53
commit 527731da53
parent 526928dca8
2 changed files with 181 additions and 0 deletions
--- a/docs/investigation-685-reviewer-approved-destructive-compose.md
+++ b/docs/investigation-685-reviewer-approved-destructive-compose.md
@ -0,0 +1,123 @@
+# Investigation: Reviewer approved destructive compose rewrite in PR #683
+
+**Issue**: #685
+**Date**: 2026-04-11
+**PR under investigation**: #683 (fix: config: gardener=1h, architect=9m, planner=11m)
+
+## Summary
+
+The reviewer agent approved PR #683 in ~1 minute without flagging that it
+contained a destructive rewrite of `docker-compose.yml` — dropping named
+volumes, bind mounts, env vars, restart policy, and security options. Five
+structural gaps in the review pipeline allowed this to pass.
+
+## Root causes
+
+### 1. No infrastructure-file-specific review checklist
+
+The review formula (`formulas/review-pr.toml`) has a generic review checklist
+(bugs, security, imports, architecture, bash specifics, dead code). It has
+**no special handling for infrastructure files** — `docker-compose.yml`,
+`Dockerfile`, CI configs, or `entrypoint.sh` are reviewed with the same
+checklist as application code.
+
+Infrastructure files have a different failure mode: a single dropped line
+(a volume mount, an env var, a restart policy) can break a running deployment
+without any syntax error or linting failure. The generic checklist doesn't
+prompt the reviewer to check for these regressions.
+
+**Fix applied**: Added step 3c "Infrastructure file review" to
+`formulas/review-pr.toml` with a compose-specific checklist covering named
+volumes, bind mounts, env vars, restart policy, and security options.
+
+### 2. No scope discipline
+
+Issue #682 asked for ~3 env var changes + `PLANNER_INTERVAL` plumbing — roughly
+10-15 lines across 3-4 files. PR #683's diff rewrote the entire compose service
+block (~50+ lines changed in `docker-compose.yml` alone).
+
+The review formula **does not instruct the reviewer to compare diff size against
+issue scope**. A scope-aware reviewer would flag: "this PR changes more lines
+than the issue scope warrants — request justification for out-of-scope changes."
+
+**Fix applied**: Added step 3d "Scope discipline" to `formulas/review-pr.toml`
+requiring the reviewer to compare actual changes against stated issue scope and
+flag out-of-scope modifications to infrastructure files.
+
+### 3. Lessons-learned bias toward approval
+
+The reviewer's `.profile/knowledge/lessons-learned.md` contains multiple entries
+that systematically bias toward approval:
+
+- "Approval means 'ready to ship,' not 'perfect.'"
+- "'Different from how I'd write it' is not a blocker."
+- "Reserve request_changes for genuinely blocking concerns."
+
+These lessons are well-intentioned (they prevent nit-picking and false blocks)
+but they create a blind spot: the reviewer suppresses its instinct to flag
+suspicious-looking changes because the lessons tell it not to block on
+"taste-based" concerns. A compose service block rewrite *looks* like a style
+preference ("the dev reorganized the file") but is actually a correctness
+regression.
+
+**Recommendation**: The lessons-learned are not wrong — they should stay. But
+the review formula now explicitly carves out infrastructure files from the
+"bias toward APPROVE" guidance, making it clear that dropped infra
+configuration is a blocking concern, not a style preference.
+
+### 4. No ground-truth for infrastructure files
+
+The reviewer only sees the diff. It has no way to compare against the running
+container's actual volume/env config. When dev-qwen rewrote a 30-line service
+block from scratch, the reviewer saw a 30-line addition and a 30-line deletion
+with no reference point.
+
+**Recommendation (future work)**: Maintain a `docker/expected-compose-config.yml`
+or have the reviewer fetch `docker compose config` output as ground truth when
+reviewing compose changes. This would let the reviewer diff the proposed config
+against the known-good config.
+
+### 5. Structural analysis blind spot
+
+`lib/build-graph.py` tracks changes to files in `formulas/`, agent directories
+(`dev/`, `review/`, etc.), and `evidence/`. It does **not track infrastructure
+files** (`docker-compose.yml`, `docker/`, `.woodpecker/`). Changes to these
+files produce no alerts in the graph report — the reviewer gets no
+"affected objectives" signal for infrastructure changes.
+
+**Recommendation (future work)**: Add infrastructure file tracking to
+`build-graph.py` so that compose/Dockerfile/CI changes surface in the
+structural analysis.
+
+### 6. Model and time budget
+
+Reviews use Sonnet (`CLAUDE_MODEL="sonnet"` at `review-pr.sh:229`) with a
+15-minute timeout. The PR #683 review completed in ~1 minute. Sonnet is
+optimized for speed, which is appropriate for most code reviews, but
+infrastructure changes benefit from the deeper reasoning of a more capable
+model.
+
+**Recommendation (future work)**: Consider escalating to a more capable model
+when the diff includes infrastructure files (compose, Dockerfiles, CI configs).
+
+## Changes made
+
+1. **`formulas/review-pr.toml`** — Added two new review steps:
+   - **Step 3c: Infrastructure file review** — When the diff touches
+     `docker-compose.yml`, `Dockerfile*`, `.woodpecker/`, or `docker/`,
+     requires checking for dropped volumes, bind mounts, env vars, restart
+     policy, security options, and network config. Instructs the reviewer to
+     read the full file (not just the diff) and compare against the base branch.
+   - **Step 3d: Scope discipline** — Requires comparing the actual diff
+     footprint against the stated issue scope. Flags out-of-scope rewrites of
+     infrastructure files as blocking concerns.
+
+## What would have caught this
+
+With the changes above, the reviewer would have:
+
+1. Seen step 3c trigger for `docker-compose.yml` changes
+2. Read the full compose file and compared against the base branch
+3. Noticed the dropped named volumes, bind mounts, env vars, restart policy
+4. Seen step 3d flag that a 3-env-var issue produced a 50+ line compose rewrite
+5. Issued REQUEST_CHANGES citing specific dropped configuration
--- a/formulas/review-pr.toml
+++ b/formulas/review-pr.toml
@ -80,6 +80,64 @@ For each BEHAVIORAL change in the diff (not pure bug fixes or formatting):
 This check is SKIPPED for pure bug fixes where the intended behavior is
 unchanged (the code was wrong, not the documentation).

+## 3c. Infrastructure file review (conditional)
+
+If the diff touches ANY of these files, apply this additional checklist:
+- `docker-compose.yml` or `docker-compose.*.yml`
+- `Dockerfile` or `docker/*`
+- `.woodpecker/` CI configs
+- `docker/agents/entrypoint.sh`
+
+Infrastructure files have a different failure mode from application code:
+a single dropped line (a volume mount, an env var, a restart policy) can
+break a running deployment with no syntax error. Treat dropped
+infrastructure configuration as a **blocking defect**, not a style choice.
+
+### For docker-compose.yml changes:
+
+1. **Read the full file** in the PR branch — do not rely only on the diff.
+2. Run `git diff <base>..HEAD -- docker-compose.yml` to see the complete
+   change, not just the truncated diff.
+3. Check that NONE of the following were dropped without explicit
+   justification in the PR description:
+   - Named volumes (e.g. `agent-data`, `project-repos`)
+   - Bind mounts (especially for config, secrets, SSH keys, shared dirs)
+   - Environment variables (compare the full `environment:` block against
+     the base branch)
+   - `restart:` policy (should be `unless-stopped` for production services)
+   - `security_opt:` settings
+   - Network configuration
+   - Resource limits / deploy constraints
+4. If ANY production configuration was dropped and the PR description does
+   not explain why, **REQUEST_CHANGES**. List each dropped item explicitly.
+
+### For Dockerfile / entrypoint changes:
+
+1. Check that base image, installed packages, and runtime deps are preserved.
+2. Verify that entrypoint/CMD changes don't break the container startup.
+
+### For CI config changes:
+
+1. Check that pipeline steps aren't silently removed.
+2. Verify that secret references still match available secrets.
+
+## 3d. Scope discipline
+
+Compare the actual diff footprint against the stated issue scope:
+
+1. Read the PR title and description to identify what the issue asked for.
+2. Estimate the expected diff size (e.g., "add 3 env vars" = ~5-10 lines
+   in compose + ~5 lines in scripts).
+3. If the actual diff in ANY single file exceeds 3x the expected scope,
+   flag it: "this file changed N lines but the issue scope suggests ~M."
+
+For infrastructure files (compose, Dockerfiles, CI), scope violations are
+**blocking**: REQUEST_CHANGES and ask the author to split out-of-scope
+changes into a separate PR or justify them in the description.
+
+For non-infrastructure files, scope violations are advisory: leave a
+non-blocking COMMENT noting the scope creep.
+
 ## 4. Vault item quality (conditional)

 If the PR adds or modifies vault item files (`vault/pending/*.md` in the ops repo), apply these