Software Engineering / 2026-05-10 / 6 min read

Who Watches Coding Agents? - More Coding Agents!

Alexander Murmann

2026-05-10

My current setup uses mostly Claude Code to write code which I then review line-by-line before it gets commited. However, a while ago I got curious and had Codex review it first. This was a huge success. Codex found a lot of issues before I ever took a look. Time safed for me! Some of the issues also were issues I am not sure I would have found at all if I am honest. Since this was useful I made this my standard practice in my BMAD scripts. I also started having Codex review the spec before coding even started. Again lost of good findings. Time and tokens very well spent.

So I wondered if I get a lot of value out of Codex doing a review, would I get more value using even more reviewers? So I set up OpenHands with Kimi 2.5. Indeed! More issues found. Unfortunately, OpenHands and Kimi worked quite badly together with frequent failures to run. Switching to Pi solved this. I was curious and quickly saw myself running half a dozen models against every spec and every commit. I did this in part because it seemed that each added model indeed did find new marginal issues. The other part was “for the science”. So I hope to share some of that “science” here.

Caveat: The data is not as good and consistent as I wish at was. There are several reasons for this but the most important factors were OpenRouter sometimes being out of quota for some models (and me not adding custom keys for all of them), and that I was too curious about the rapid succession of new model releases we had recently. All this resulted in not all runs having used the full set of reviewer models. In general I wish there was way more data but I also want to share what I’ve found so far, especially since I know I’ll want to jump to other, newer models again once they come out.

The most important learning from this exercise is very clear even from my limited, noisy data: Adding more models as reviewers absolutely carries its weight. For anything but a CRUD feature, even running ~6 models seems more than worthwhile. That is already very clear from this small and noisy dataset. Reality is that actual results will also be different for your codebase and next week’s models.

Models Run

All the models I ran in some shape or form were: Claude Opus 4, Claude Sonnet 4, GPT-5.x (Codex), GLM 5.1, DeepSeek V3.1 / V4 Pro / V4 Flash, Grok 4.3, Kimi 2.5 / K2-6, and Semgrep as a static analysis baseline. Note that the Claude Opus reviewer followed the BMAD Adversarial General process

Results

Nearly Half of All Bugs Were Found by a Single Reviewer

Out of 285 confirmed real bugs across 17 PRs, 45% were found by exactly one reviewer. Another 31% by only two. No single model finds even half the bugs. A clear argument for always running multiple reviewers.

More Reviewers, More Coverage — With Diminishing Returns

Using a greedy set-cover to find the optimal addition order, the first three reviewers cover 82% of all findings. The first two are both subscription-based, so effectively $0/run, already cover 72%. GLM 5.1 via OpenRouter at ~$0.I a40/run pushes to 82%. After three, each new reviewer still helps but the gains flatten.

#	Add Reviewer	New Finds	Cumulative Coverage
1	Claude Opus	+133	46.7%
2	Codex (GPT-5.x)	+72	71.9%
3	GLM 5.1	+29	82.1%
4	Sonnet (via Pi)	+18	88.4%
5	Sonnet (subagent)	+18	94.7%
6	DeepSeek V4 Pro	+6	96.8%

Sonnet appears twice because it was run through two different harnesses. Once via Pi, once as a direct Claude subagent. Unfortunately, I only ever ran these on non-overlapping sets of PRs. Still interesting to note that despite being the same model, the Pi-mediated runs had noticeably higher precision (93% vs 70%), suggesting the agent harness and prompting may matter as much as the model itself. The sample sizes are small though (10 and 7 runs respectively), so more data is needed.

It’s Cheap

Three of the five recommended reviewers run on subscriptions I’m already paying for at $0 marginal cost per review. It’s easy to argue that the other reviewers are well worth their cost. Even if GLM or DeepSeek find a single bug per run and if only one of these each week would have escaped it’s well worth the cost. It’s of course uncler how many of these bugs would have evaded my review but why risk that for such a small amount of money?

Reviewer	Cost Model	≈ Cost per Run
Claude Opus	Subscription	~$0
Codex	Subscription	~$0
GLM 5.1	OpenRouter	~$0.40
Sonnet (subagent)	Subscription	~$0
DeepSeek V4 Pro	OpenRouter	~$0.26

Precision Varies Wildly

Not all output is signal. Precision (the share of findings that are actually real) varies wildly. DeepSeek V4 Pro sits at 68%, while Sonnet via Pi hits 93%. A low-precision reviewer wastes the time you spend reading and dismissing noise. However, Claude was really good at filtering out most of the noise proactively.

Reviewer	Precision	Real Finds	False Positives
Sonnet (via Pi)	93%	63	5
Codex	87%	117	17
Opus	82%	133	29
GLM 5.1	73%	74	27
Sonnet (subagent)	70%	58	25
DeepSeek V4 Pro	68%	36	17

On filtering out by Claude, I did an experiment that is too early to share real data on but seems to be going well so far: Instead of just having Claude ask Codex for a review, I now ask Claude to ask for a review and then discuss results with Codex. A few times this supposedly has resulted in a shared proposal that was different from what either model had proposed originally and that when reported to me, I also preferred. I wish I had written down an example of this but there is always another blog post.

Next Steps

For my own work, I’ll run Codex and Sonnet as reviewers for every piece of work, regardless how simple. For more complex or critical work, I’ll also run GLM, DeepSeek V4 Pro. Of course I’ll always be cycling through new models to assess what should comprise the reviewer pool.

In terms of data collection, I’ll continue logging results to inform what models I run going forward. I recommend for anyone who is considering to also explore a larger reviewer pool to also log their data. I suspect that what models brings value might likely depend very much on the stack and project.

I also will look at fully automating this entire process. Ideally Claude just logs run results, picks models based on that, fetches news about new models to try and does so indepently. Ideally I only hear about this in our weekly BMAD retro. The only issue I could foresee is that I might want to be involved in decisions to run more expensive models.

Software Engineering, AI