Who Watches Coding Agengs? - More Coding Agents!
My current setup uses mostly Claude Code to write code which I then review line-by-line before it gets commited. However, a while ago I got curious and had Codex review it first. This was a huge success. Codex found a lot of issues before I ever took a look. Time safed for me! Some of the issues also were issues I am not sure I would have found at all if I am honest. Since this was useful I made this my standard practice in my BMAD scripts. I also started having Codex review the spec before coding even started. Again lost of good findings. Time and tokens very well spent.
So I wondered if I get a lot of value out of Codex doing a review, would I get more value using even more reviewers? So I set up OpenHands with Kimi 2.5. Indeed! More issues found. Unfortunately, OpenHands and Kimi worked quite badly together with frequent failures to run. Switching to Pi solved this. I was curious and quickly saw myself running half a dozen models against every spec and every commit. I did this in part because it seemed that each added model indeed did find new marginal issues. The other part was “for the science”. So I hope to share some of that “science” here.
Caveat: The data is not as good and consistent as I wish at was. There are several reasons for this but the most important factors were OpenRouter sometimes being out of quota for some models (and me not adding custom keys for all of them), and that I was too curious about the rapid succession of new model releases we had recently. All this resulted in not all runs having used the full set of reviewer models. In general I wish there was way more data but I also want to share what I’ve found so far, especially since I know I’ll want to jump to other, newer models again once they come out.
The most important learning from this exercise is very clear even from my limited, noisy data: Adding more models as reviewers absolutely carries its weight. For anything but a CRUD feature, even running ~6 models seems more than worthwhile. That is already very clear from this small and noisy dataset. Reality is that actual results will also be different for your codebase and next week’s models.
Models Run
All the models I ran in some shape or form were: Claude Opus 4, Claude Sonnet 4, GPT-5.x (Codex), GLM 5.1, DeepSeek V3.1 / V4 Pro / V4 Flash, Grok 4.3, Kimi 2.5 / K2-6, and Semgrep as a static analysis baseline. Note that the Claude Opus reviewer followed the BMAD Adversarial General process
Results
Nearly Half of All Bugs Were Found by a Single Reviewer
Out of 285 confirmed real bugs across 17 PRs, 45% were found by exactly one reviewer. Another 31% by only two. No single model finds even half the bugs. A clear argument for always running multiple reviewers.
More Reviewers, More Coverage — With Diminishing Returns
Using a greedy set-cover to find the optimal addition order, the first three reviewers cover 82% of all findings. The first two are both subscription-based, so effectively $0/run, already cover 72%. GLM 5.1 via OpenRouter at ~$0.I a40/run pushes to 82%. After three, each new reviewer still helps but the gains flatten.
| # | Add Reviewer | New Finds | Cumulative Coverage |
|---|---|---|---|
| 1 | Claude Opus | +133 | 46.7% |
| 2 | Codex (GPT-5.x) | +72 | 71.9% |
| 3 | GLM 5.1 | +29 | 82.1% |
| 4 | Sonnet (via Pi) | +18 | 88.4% |
| 5 | Sonnet (subagent) | +18 | 94.7% |
| 6 | DeepSeek V4 Pro | +6 | 96.8% |
Sonnet appears twice because it was run through two different harnesses. Once via Pi, once as a direct Claude subagent. Unfortunately, I only ever ran these on non-overlapping sets of PRs. Still interesting to note that despite being the same model, the Pi-mediated runs had noticeably higher precision (93% vs 70%), suggesting the agent harness and prompting may matter as much as the model itself. The sample sizes are small though (10 and 7 runs respectively), so more data is needed.
It’s Cheap
Three of the five recommended reviewers run on subscriptions I’m already paying for at $0 marginal cost per review. It’s easy to argue that the other reviewers are well worth their cost. Even if GLM or DeepSeek find a single bug per run and if only one of these each week would have escaped it’s well worth the cost. It’s of course uncler how many of these bugs would have evaded my review but why risk that for such a small amount of money?
| Reviewer | Cost Model | ≈ Cost per Run |
|---|---|---|
| Claude Opus | Subscription | ~$0 |
| Codex | Subscription | ~$0 |
| GLM 5.1 | OpenRouter | ~$0.40 |
| Sonnet (subagent) | Subscription | ~$0 |
| DeepSeek V4 Pro | OpenRouter | ~$0.26 |
Precision Varies Wildly
Not all output is signal. Precision (the share of findings that are actually real) varies wildly. DeepSeek V4 Pro sits at 68%, while Sonnet via Pi hits 93%. A low-precision reviewer wastes the time you spend reading and dismissing noise. However, Claude was really good at filtering out most of the noise proactively.
| Reviewer | Precision | Real Finds | False Positives |
|---|---|---|---|
| Sonnet (via Pi) | 93% | 63 | 5 |
| Codex | 87% | 117 | 17 |
| Opus | 82% | 133 | 29 |
| GLM 5.1 | 73% | 74 | 27 |
| Sonnet (subagent) | 70% | 58 | 25 |
| DeepSeek V4 Pro | 68% | 36 | 17 |
On filtering out by Claude, I did an experiment that is too early to share real data on but seems to be going well so far: Instead of just having Claude ask Codex for a review, I now ask Claude to ask for a review and then discuss results with Codex. A few times this supposedly has resulted in a shared proposal that was different from what either model had proposed originally and that when reported to me, I also preferred. I wish I had written down an example of this but there is always another blog post.
Next Steps
For my own work, I’ll run Codex and Sonnet as reviewers for every piece of work, regardless how simple. For more complex or critical work, I’ll also run GLM, DeepSeek V4 Pro. Of course I’ll always be cycling through new models to assess what should comprise the reviewer pool.
In terms of data collection, I’ll continue logging results to inform what models I run going forward. I recommend for anyone who is considering to also explore a larger reviewer pool to also log their data. I suspect that what models brings value might likely depend very much on the stack and project.
I also will look at fully automating this entire process. Ideally Claude just logs run results, picks models based on that, fetches news about new models to try and does so indepently. Ideally I only hear about this in our weekly BMAD retro. The only issue I could foresee is that I might want to be involved in decisions to run more expensive models.
