Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
OpenAI o3-mini vs Anthropic Sonnet 3.7 Thinking for Bug Detection

OpenAI o3-mini vs Anthropic Sonnet 3.7 Thinking for Bug Detection

April 3, 2025 (1w ago)

Written by Everett Butler

Bug detection is one of the hardest tasks in software engineering — and it’s just as challenging for AI models. Unlike code generation, which often relies on pattern completion, bug detection requires inference, planning, and the ability to reason through edge cases and hidden logic.

In this post, we compare OpenAI’s o3-mini and Anthropic’s Sonnet 3.7 Thinking, two compact models that bring very different architectures to the table. The former is a reasoning-first OpenAI model. The latter, an explicitly “thinking” variant from Anthropic, adds a planning step before response generation. Which one catches more bugs?

🧪 Evaluation Setup

We tested both models on a benchmark of 210 small programs, each with one subtle bug — typically a logic error, a concurrency edge case, or an intentionally tricky misuse of a common API.

Languages tested: Python, TypeScript, Go, Rust, and Ruby. Each model got the same prompt, context, and structure.

Here are the programs we created for the evaluation:

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

  1. Undefined \response\ variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

📊 Results

Overall Bugs Caught

  • OpenAI o3-mini: 37
  • Anthropic Sonnet 3.7 Thinking: 21

By Language

  • Python: o3-mini 7, Sonnet 3.7 Thinking 2
  • TypeScript: o3-mini 7, Sonnet 3.7 Thinking 5
  • Rust: o3-mini 9, Sonnet 3.7 Thinking 5
  • Ruby: o3-mini 7, Sonnet 3.7 Thinking 5
  • Go: o3-mini 7, Sonnet 3.7 Thinking 4

OpenAI’s o3-mini outperformed Sonnet 3.7 Thinking across the board — especially in Python and Rust. Sonnet was more competitive in Ruby and TypeScript, and showed reasonable generalization in Go.

🧠 Thoughts

Despite being a “thinking” model, Anthropic’s Sonnet 3.7 Thinking did not outperform OpenAI’s o3-mini in overall bug detection.

o3-mini demonstrated stronger consistency across languages — and particularly dominated in Python and Rust, likely due to a combination of:

  • Pattern recognition, thanks to extensive training on code
  • Internal planning, which o3-mini appears to do even without an explicit “thinking step”

Sonnet 3.7 Thinking did show sparks of reasoning strength in some lower-resource languages like Ruby and Go, where logic deduction plays a bigger role and memorization falls short. But it didn’t close the gap overall.

It’s possible that Sonnet’s thinking step helps in theory, but doesn’t yet translate into better bug detection performance — at least on this dataset.

🔍 Where Reasoning Helped

One case showed Sonnet 3.7 Thinking’s strengths:

Test Case 2 — Python: Asynchronous Path Bug

This involved a path reconstruction function that relied on a shared dictionary (came_from) across asynchronous contexts. A race condition could corrupt the logic mid-execution.

  • Sonnet 3.7 Thinking’s Output:

    "The most critical bug is that the function does not protect against asynchronous modifications to the came_from dictionary during path reconstruction, potentially leading to inconsistent paths or infinite loops."

  • o3-mini missed the issue.

This is where reasoning helps. Sonnet correctly inferred future states, race conditions, and the consequences of mutation — without obvious syntactic cues. It’s a good sign for future “thinking” model generations.

✅ Final Thoughts

OpenAI o3-mini outperformed Sonnet 3.7 Thinking in this benchmark — both in overall accuracy and consistency across languages.

But that doesn’t mean reasoning models aren’t useful. When bugs require following multiple steps of logic or simulating asynchronous behavior, Sonnet 3.7 Thinking has real value — just not enough (yet) to win outright.


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required