AI CODE REVIEW BENCHMARKS
An evaluation of 5 AI code review tools across 50 real-world bugs from production codebases (2025). See which tools actually catch the issues that matter.
Overview
We compare 5 AI code review tools on 50 real-world pull requests to surface practical differences in how they catch bugs, manage signal versus noise, support multiple languages, and impact review quality.
Each tool was evaluated with default settings (no custom rules or fine-tuning). We measured bug-catch rates, comment quality, noise levels, time to review, and setup experience to reflect how these tools perform in everyday use.
All PRs come from public, verifiable repositories, so you can inspect the sources and reproduce the runs on your own. If you'd like the exact protocol, see the Methodology section.
Methodology
The dataset covers 5 open-source GitHub repositories in different languages. From each, 10 real bug-fix PRs were traced back to the commits that introduced the bugs. Extremely large or single-file changes were excluded to keep the set realistic.
For each case, two branches were created: one before the bug and one after the fix. A fresh PR reintroduced the original change and was replicated across 5 clean forks, one per code review tool. Each tool had full repository access, including the PR diff and base branches.
All tools ran in their hosted cloud plans with default settings (no custom rules), and reviews were triggered by opening the PR or invoking the bot. A bug counted as "caught" only when the tool explicitly identified the faulty code in a line-level comment and explained the impact. All results were verified against the known bug.
Note that this evaluation was conducted in July 2025, and these tools evolve quickly, so performance may change over time. Scoring considered only detection of the original bug; false positives, style suggestions, and unrelated comments did not affect the catch rate.
Bug Catch Performance
Greptile led with an 82% catch rate, 41% higher than Bugbot (58%). The rest stack clearly: Bugbot and Copilot in the mid-50s, CodeRabbit at 44%, and Graphite at 6%.
Case Library
Performance varies by repository and language. The tables list every PR in the test set with a one-line bug summary, severity, and whether each tool caught it. Tool names link to the tool's run, and each ā/ā links to the exact PR so you can review comments, summaries, and outputs.
The right choice depends on priorities. Some tools produced richer summaries, some were faster, and some were quieter. Use the tables to inspect cases that match your stack and tolerance for noise.
Caught = an explicit line-level PR comment that points to the faulty code and explains the impact. Summary-only mentions do not count.