Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
Is Claude Sonnet 3.7 Thinking Better Than Claude Sonnet 3.7 at Catching Bugs?

Is Claude Sonnet 3.7 Thinking Better Than Claude Sonnet 3.7 at Catching Bugs?

March 31, 2025 (3d ago)

Written by Daksh Gupta

I am Daksh from Greptile. We're working on an AI code review bot that catches bugs and anti-patterns in pull requests. We heavily use LLMs for this. Naturally, a key factor of the success of our bot is the effectiveness with which the underlying model is able to find bugs given all the necessary code and context.

Bug detection is a materially different problem than code generation. The former is more difficult, but also not directionally similar to the latter.

When reasoning models started to get good late last year with the release of OpenAI o3-mini, we became interested in the possibility that reasoning models would outperform traditional LLMs at finding bugs in code.

Reasoning models are functionally similar to LLMs, except before they begin generating their response, they have a planning or thinking step. They use some tokens to solidify the user's intent, then reasoning about the steps, then starting the actual response generation.

The problem was that o3-mini isn't simply a reasoning/thinking version of another OpenAI LLM. For all we know, it's a larger model. As a result, there wasn't a clear comparison to be made.

Recently, Anthropic unveiled Claude Sonnet 3.7, the latest iteration of the Sonnet series, whose previous iteration (3.5) has been the preferred codegen model for months. Along with 3.7 they released 3.7 Thinking, the same model, but with reasoning capabilities. With this, we can do a fair evaluation.

The Evaluation Dataset

I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

  1. Undefined `response` variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

Results

Of the 210 bugs, Claude 3.7 correctly discovered 17. The thinking version discovered 23.

Bar chart comparing bug detection rates between Claude Sonnet 3.7 and Claude Sonnet 3.7 Thinking

Bug detection rates across all languages

The numbers being small is unsurprising to me. These were difficult bugs - and part of why I am excited about the future of AI in software verification is that we're at the earliest stages of foundational model capabilities in this realm, and AI code reviewers are already useful. One can only imagine how useful they will be a couple of years from now.

Thinking is a bigger improvement for certain languages

Something that surprised me was that the breakdowns had meaningful variance by language.

Bar chart showing bug detection rates by programming language

Bug detection rates broken down by programming language

The thinking model performed no better than the non-thinking model when catching bugs in TypeScript and Python, caught a few more in Go and Rust, and performed much better with Ruby.

A theory on why this is happening - LLMs are trained on far more TS and Python than the other languages, so even the non-thinking LLM is able to pick up on bugs by simply pattern matching. For less widely used languages like Ruby and Rust, the thinking model helped, by thinking through possible issues logically.

Interesting Bugs

  1. Gain Calculation in Audio Processing Library (Ruby)

    The bug in this file was in the TimeStretchProcessor class of a Ruby audio processing library, specifically in how it calculates normalize_gain. Instead of adjusting the gain based on the stretch_factor—which represents how much the audio is being sped up or slowed down—it uses a fixed formula. This means the output audio ends up with the wrong amplitude: either too loud or too quiet, depending on the stretch. The correct approach would be to scale the gain relative to the stretch to preserve consistent audio levels.

    Sonnet 3.7 failed to detect this bug, but 3.7 Thinking caught it

  2. Race Condition in Smart Home Notification System (Go)

    The most critical bug in this code was that there was no locking around device updates before broadcasting, which could lead to race conditions where clients receive stale or partially updated device state. This occurred in the `NotifyDeviceUpdate` method of `ApiServer` which broadcasts device updates without proper synchronization with device state modifications.

    In this case too, only the thinking model caught this issue.

It seems that Sonnet 3.7 Thinking is a better model for catching bugs in code than the base 3.7 model. There were no examples of Thinking missing bugs that the base 3.7 caught. Oddly, I still gravitate towards 3.7 in Cursor for code generation, likely a mix of it being faster and reasoning not being as useful for the type of codegen I do in Cursor. When it comes to detecting bugs, reasoning wins.


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required