AI CODE REVIEW
EVALUATION (2025)

An evaluation of 5 AI code review tools across 50 real-world bugs from production codebases. See which tools actually catch the issues that matter.

Overview

We compare 5 AI code review tools on 50 real-world pull requests to surface practical differences in how they catch bugs, manage signal versus noise, support multiple languages, and impact review quality.
Each tool was evaluated with default settings (no custom rules or fine-tuning). We measured bug-catch rates, comment quality, noise levels, time to review, and setup experience to reflect how these tools perform in everyday use.
All PRs come from public, verifiable repositories, so you can inspect the sources and reproduce the runs on your own. If you'd like the exact protocol, see the Methodology section.
Bug Detection by Severity Level
GREPTILE VS
100%75%50%25%0%
58%
58%
50%
33%
17%
Critical
100%
64%
57%
36%
0%
High
89%
56%
78%
56%
11%
Medium
87%
53%
87%
53%
0%
Low
Greptile
Cursor
Copilot
CodeRabbit
Graphite

Methodology

The dataset covers 5 open-source GitHub repositories in different languages. From each, 10 real bug-fix PRs were traced back to the commits that introduced the bugs. Extremely large or single-file changes were excluded to keep the set realistic.
For each case, two branches were created: one before the bug and one after the fix. A fresh PR reintroduced the original change and was replicated across 5 clean forks, one per code review tool. Each tool had full repository access, including the PR diff and base branches.
All tools ran in their hosted cloud plans with default settings (no custom rules), and reviews were triggered by opening the PR or invoking the bot. A bug counted as "caught" only when the tool explicitly identified the faulty code in a line-level comment and explained the impact. All results were verified against the known bug.
Note that this evaluation was conducted in July 2025, and these tools evolve quickly, so performance may change over time. Scoring considered only detection of the original bug; false positives, style suggestions, and unrelated comments did not affect the catch rate.
Test Dataset
Python
Sentry
Error tracking & performance monitoring
TypeScript
Cal.com
Open source scheduling infrastructure
Go
Grafana
Monitoring & observability platform
Java
Keycloak
Identity & access management
Ruby
Discourse
Community discussion platform

Bug Catch Performance

Greptile led with an 82% catch rate, 41% higher than Cursor (58%). The rest stack clearly: Cursor and Copilot in the mid-50s, CodeRabbit at 44%, and Graphite at 6%.
Overall Performance
100%75%50%25%0%
82%
Greptile
58%
Cursor
54%
Copilot
44%
CodeRabbit
6%
Graphite
CRITICAL
100500
58%
Gre
58%
Cur
50%
Cop
33%
Cod
17%
Gra
HIGH
100500
100%
Gre
64%
Cur
57%
Cop
36%
Cod
0%
Gra
MEDIUM + LOW
100500
88%
Gre
58%
Cur
55%
Cop
55%
Cod
6%
Gra

Case Library

Performance varies by repository and language. The tables list every PR in the test set with a one-line bug summary, severity, and whether each tool caught it. Tool names link to the tool's run, and each ✓/✗ links to the exact PR so you can review comments, summaries, and outputs.
The right choice depends on priorities. Some tools produced richer summaries, some were faster, and some were quieter. Use the tables to inspect cases that match your stack and tolerance for noise.
Caught = an explicit line-level PR comment that points to the faulty code and explains the impact. Summary-only mentions do not count.

Enhanced Pagination Performance for High-Volume Audit Logs

Importing non-existent OptimizedCursorPaginator

High

Optimize spans buffer insertion with eviction during insert

Negative offset cursor manipulation bypasses pagination boundaries

Critical

Support upsampled error count with performance optimizations

sample_rate = 0.0 is falsy and skipped

Low

GitHub OAuth Security Enhancement

Null reference if github_authenticated_user state is missing

Critical

Replays Self-Serve Bulk Delete System

Breaking changes in error response format

Critical

Span Buffer Multiprocess Enhancement with Health Monitoring

Inconsistent metric tagging with 'shard' and 'shards'

Medium

Implement cross-system issue synchronization

Shared mutable default in dataclass timestamp

Medium

Reorganize incident creation / issue occurrence logic

Using stale config variable instead of updated one

High

Add ability to use queues to manage parallelism

Invalid queue.ShutDown exception handling

High

Add hook for producing occurrences from the stateful detector

Incomplete implementation (only contains pass)

High
AI Code Review Benchmarks 2025 | Greptile