AI Code Review: What Developers Need to Know

I work at Greptile, where we build an AI code review tool. So take what follows with that context. But this post isn't a product walkthrough. It's what we've learned from reviewing over 700,000 pull requests per month about how AI is changing the review process, and what developers should actually pay attention to.

Whether you use AI to write code or not, it's already in your pull requests. Your teammates are using Cursor, Copilot, or Claude Code. AI coding tools are already widely adopted. The volume of code hitting review has gone up. The nature of that code has changed. And the way you review it probably needs to change too.

Why AI-Generated Code Is Harder to Review

When a human writes a bug, it's usually because they were tired, rushed, or didn't know about an edge case. The code often looks a little off. A quick scan of the diff and something feels wrong.

AI-generated code doesn't work that way. It's syntactically clean, well-structured, and reads like it was written by someone who knows what they're doing. The bugs it introduces are different: a function that handles 95% of cases perfectly but silently drops the other 5%. A retry loop that wraps another retry loop, turning 5 attempts into 25. An authentication check that looks correct in the file it's in but doesn't match how every other endpoint in the codebase does it.

The failure mode isn't sloppiness. It's confidence. The code looks right, which means reviewers are more likely to skim it. And the bugs tend to be cross-file logic issues, the kind you only catch if you understand how the change interacts with the rest of the codebase, not just the diff in front of you.

This shows up in our data. Across 3.4 million completed reviews, nearly half of all flagged issues are logic errors, not syntax or style problems.

Issue Type	% of Flagged Issues	What It Catches
Logic	48%	Incorrect behavior, race conditions, wrong return values
Style	42%	Dead code, naming inconsistencies, unnecessary complexity
Syntax	10%	Missing imports, type errors, invalid syntax

The bugs AI introduces aren't the kind a linter would catch. They're the kind that require reading the rest of the codebase to even recognize as bugs.

Why AI Code Reviewers Should Be Independent

This is the most important thing to internalize about AI code review: the tool that generates the code should not be the same tool that reviews it.

It sounds obvious when you say it out loud. We separated auditing from accounting decades ago for exactly this reason. Independent review is a well-established practice. But in practice, many teams are letting the same AI context that helped write a PR also validate it, by asking Copilot to "review this" or having their coding assistant self-check before pushing.

The problem is shared assumptions. An AI that generated code based on a certain interpretation of the codebase will review that code with the same interpretation. It won't catch its own blind spots. If it invented a plausible-looking API call that doesn't actually exist, it'll read that call during review and think it looks fine. Because it wrote it.

An independent reviewer starts from the codebase itself. It reads how things actually work, not how the generator assumed they work. That gap, between assumption and reality, is where the most dangerous bugs live.

The scale of the problem is real:

Metric	Value
PRs reviewed	2.2M
PRs with at least one issue flagged	69.5%
Median comments per flagged PR	2
Comments rated helpful by developers	58%

Most PRs have something worth catching, and most of it gets missed without an independent set of eyes.

Confidence Scores: The Developer Tool for PR Triage

If every PR gets a wall of AI-generated comments, you'll ignore all of them within a week. The most useful thing an AI code review tool can do isn't just flag issues. It's tell you how much attention a PR actually needs.

This is what confidence scores do. Instead of treating every pull request as equal, a score tells you: this PR is clean, merge it. Or: this one has problems, spend your time here.

A config file rename doesn't need the same scrutiny as a payments refactor. A well-tested utility function doesn't need the same review as a new auth flow. Confidence scores encode that difference so you can allocate review time where it actually matters. For engineering leads managing a team's PR queue, this is the practical shift: scan scores across open PRs and you immediately know which ones need a human reviewer's full attention and which ones have already been thoroughly validated.

In our data, the difference in outcomes between high and low confidence PRs is stark:

Confidence Score	% of PRs	Median Time to Merge	Action
4-5 (safe)	63%	4.0 hrs	Merge after minor fixes or immediately
1-3 (needs attention)	37%	14.3 hrs	Address feedback, rework, or rethink

That's a 3.6x difference in merge speed, and it maps almost entirely to how much rework is needed. This isn't about skipping review. It's about spending review time on the right things.

Real AI Code Review Catches in Production

The best way to evaluate any AI code review tool is to look at what it actually catches. Not style nits or formatting suggestions. Real bugs that made it through human review in production codebases.

Nested retry loop. A new method wrapped an HTTP request in a 5-iteration retry loop. But the method it called already retries 5 times internally. A single connection failure triggered 25 attempts with compounded exponential backoff. What should fail in 30 seconds could hang for minutes. The code read fine. You had to know the internals of the called method to see the problem. Netflix Metaflow - PR #2817

Reasoning traces leak into user responses. Internal AI reasoning traces were bleeding into user-facing responses across conversation turns. In a safety-critical guardrails system, users were seeing raw chain-of-thought output that was never meant to leave the server. NVIDIA NeMo Guardrails - PR #1468

Three conflicting reward implementations. The README, the evaluator, and the environment each computed rewards differently. Dead code in one file was silently overridden by another. No single file looked wrong. You had to read all three to see they disagreed. Meta PyTorch - PR #308

None of these are obscure. They're the kind of bugs that experienced developers would catch if they had unlimited time and full context of every file in the codebase. Across nearly 5,000 open source repos, we've flagged over 590,000 issues like these. See more real examples here. The value of an AI code review developer tool isn't that it's smarter. It's that it reads everything, every time, without fatigue.

What to Look For in an AI Code Review Tool

You don't need to overhaul how your team works. But a few things are worth adjusting.

Your review instincts need to update for AI-generated code. The impulse to skim clean-looking code is going to get more dangerous as more of it comes from AI. If a diff looks too clean, that might be a reason to look harder, not faster.

Independence matters more than intelligence. When evaluating AI code review tools, the most important question isn't how smart the model is. It's whether the reviewer has its own understanding of your codebase, separate from whatever wrote the code.

Signal control is non-negotiable. A reviewer that floods every PR with low-value comments will get turned off within a month. Severity controls, comment type filtering, and a system that learns from your team's feedback over time are table stakes. The goal is fewer, better comments. How teams actually solve this is harder than it sounds.

Confidence scores are a triage tool, not a rubber stamp. A high score means the reviewer found nothing concerning. It doesn't mean nothing is wrong. Use scores to decide where to invest your time, not to skip review entirely. When most PRs don't receive human comments, having a reliable signal for which ones should matters more than ever.

The volume of code being written isn't going back down. The proportion of it written by AI isn't either. The teams that adjust their review process for this reality will ship faster and catch more. The ones that don't will also ship faster. They just won't catch anything.

Keep Reading

01

engineering

Sandboxing agents at the kernel level

Tracing the open syscall to understand how containers conceal files from agents.

Sep 29, 2025

02

product

Greptile v3, an agentic approach to code review

Announcing Greptile v3, a complete rewrite of our code review workflow that uses an agentic approach to autonomously validate code with 256% better upvote/downvote ratios and 70.5% higher acceptance rates.

Nov 26, 2025

03

programming