Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
Bug Detection Showdown: OpenAI o3-mini vs Claude 3.5 Sonnet

Bug Detection Showdown: OpenAI o3-mini vs Claude 3.5 Sonnet

April 3, 2025 (2w ago)

Written by Everett Butler

Bug detection is a very different task from code generation โ€” and a much harder one. It requires understanding, reasoning, and an ability to infer what the code is trying to do. The best models don't just match patterns โ€” they think.

In this post, we compare two capable models: OpenAI's o3-mini and Anthropic's Sonnet 3.5. Both were tested across five programming languages on the same dataset of challenging, real-world bugs. Let's see how they performed.

๐Ÿงช Evaluation Setup

We used a benchmark of 210 hand-crafted programs, each containing one hard-to-catch bug. The bugs were realistic: logic flaws, race conditions, edge cases โ€” the kind of things that slip through review and cause production issues.

We wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.

Here are the programs we created for the evaluation:

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

  1. Undefined \response\ variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

๐Ÿ“Š Results

Total Bugs Caught

  • o3-mini: 37 bugs
  • Sonnet 3.5: 26 bugs

By Language

  • Python: o3-mini found 7, Sonnet found 3
  • TypeScript: o3-mini found 7, Sonnet found 5
  • Go: Sonnet slightly edged ahead with 8, o3-mini found 7
  • Rust: o3-mini found 9, Sonnet found 3
  • Ruby: Both models tied โ€” 7 bugs each

Despite Sonnet being reasoning-first, o3-mini performed better overall โ€” especially in Python and Rust, where pattern recognition from heavy training data may have played a key role.

๐Ÿง  Observations

Why does o3-mini win overall?
o3-mini appears to benefit from strong language coverage and a hybrid of reasoning + memorization. It shines in Python and Rust, likely due to rich training data and effective internal planning steps.

Where does Sonnet do well?
In Go and Ruby, Sonnet 3.5 shows that reasoning models can generalize โ€” especially when training data is thinner and deductive logic is needed. It also successfully flagged race conditions that o3-mini missed.

These results reinforce a growing pattern: reasoning models outperform when bugs require logic, while pattern-matching models excel when there's lots of prior signal.

๐Ÿ” Reasoning Matters More in Certain Languages

In languages like Go and Ruby โ€” less represented in LLM training โ€” pattern recognition alone isn't enough. That's where Sonnet 3.5's reasoning pipeline starts to shine.

In languages like Python and TypeScript, o3-mini can lean on pre-learned bug patterns, which might explain its consistent edge.

No model wins universally โ€” performance depends heavily on language, bug type, and whether reasoning or memory is more useful for the task.

๐Ÿงฉ A Bug Worth Highlighting

Test 23 โ€” Go: Race Condition in NotifyDeviceUpdate

This bug lived in a smart home system's ApiServer. It updated shared state and broadcasted it to clients โ€” but without synchronization. Classic race condition.

  • Sonnet 3.5 caught it and explained:

    "The most critical bug is that there was no locking around device updates before broadcasting, which could lead to race conditions where clients receive stale or partially updated device state..."

  • o3-mini missed it entirely.

This is a great example of where reasoning wins. There's no surface-level bug โ€” just an implicit contract being broken. Sonnet 3.5 inferred the risk from logic and flow, not just syntax.

โœ… Final Thoughts

OpenAI's o3-mini wins this round overall โ€” but the battle is closer than it looks.

  • o3-mini dominates in languages with strong training coverage and common bug patterns.
  • Sonnet 3.5 stands out in concurrency issues and less conventional scenarios where reasoning is critical.

If you're choosing an AI model for bug detection, think about your stack โ€” and the types of bugs you care about. The best choice might vary depending on whether you're dealing with frontend JS or distributed Go microservices.


Greptile uses models like o3-mini in production to surface real bugs in PRs โ€” logic errors, concurrency issues, you name it. Want to see what it finds in your code? Try Greptile โ€” no credit card required.


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required