Bug detection is a very different task from code generation โ and a much harder one. It requires understanding, reasoning, and an ability to infer what the code is trying to do. The best models don't just match patterns โ they think.
In this post, we compare two capable models: OpenAI's o3-mini and Anthropic's Sonnet 3.5. Both were tested across five programming languages on the same dataset of challenging, real-world bugs. Let's see how they performed.
๐งช Evaluation Setup
We used a benchmark of 210 hand-crafted programs, each containing one hard-to-catch bug. The bugs were realistic: logic flaws, race conditions, edge cases โ the kind of things that slip through review and cause production issues.
We wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.
Here are the programs we created for the evaluation:
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined \response\ variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
๐ Results
Total Bugs Caught
- o3-mini: 37 bugs
- Sonnet 3.5: 26 bugs
By Language
- Python: o3-mini found 7, Sonnet found 3
- TypeScript: o3-mini found 7, Sonnet found 5
- Go: Sonnet slightly edged ahead with 8, o3-mini found 7
- Rust: o3-mini found 9, Sonnet found 3
- Ruby: Both models tied โ 7 bugs each
Despite Sonnet being reasoning-first, o3-mini performed better overall โ especially in Python and Rust, where pattern recognition from heavy training data may have played a key role.
๐ง Observations
Why does o3-mini win overall?
o3-mini appears to benefit from strong language coverage and a hybrid of reasoning + memorization. It shines in Python and Rust, likely due to rich training data and effective internal planning steps.
Where does Sonnet do well?
In Go and Ruby, Sonnet 3.5 shows that reasoning models can generalize โ especially when training data is thinner and deductive logic is needed. It also successfully flagged race conditions that o3-mini missed.
These results reinforce a growing pattern: reasoning models outperform when bugs require logic, while pattern-matching models excel when there's lots of prior signal.
๐ Reasoning Matters More in Certain Languages
In languages like Go and Ruby โ less represented in LLM training โ pattern recognition alone isn't enough. That's where Sonnet 3.5's reasoning pipeline starts to shine.
In languages like Python and TypeScript, o3-mini can lean on pre-learned bug patterns, which might explain its consistent edge.
No model wins universally โ performance depends heavily on language, bug type, and whether reasoning or memory is more useful for the task.
๐งฉ A Bug Worth Highlighting
Test 23 โ Go: Race Condition in NotifyDeviceUpdate
This bug lived in a smart home system's ApiServer. It updated shared state and broadcasted it to clients โ but without synchronization. Classic race condition.
-
Sonnet 3.5 caught it and explained:
"The most critical bug is that there was no locking around device updates before broadcasting, which could lead to race conditions where clients receive stale or partially updated device state..."
-
o3-mini missed it entirely.
This is a great example of where reasoning wins. There's no surface-level bug โ just an implicit contract being broken. Sonnet 3.5 inferred the risk from logic and flow, not just syntax.
โ Final Thoughts
OpenAI's o3-mini wins this round overall โ but the battle is closer than it looks.
- o3-mini dominates in languages with strong training coverage and common bug patterns.
- Sonnet 3.5 stands out in concurrency issues and less conventional scenarios where reasoning is critical.
If you're choosing an AI model for bug detection, think about your stack โ and the types of bugs you care about. The best choice might vary depending on whether you're dealing with frontend JS or distributed Go microservices.
Greptile uses models like o3-mini in production to surface real bugs in PRs โ logic errors, concurrency issues, you name it. Want to see what it finds in your code? Try Greptile โ no credit card required.