OpenAI has been shipping increasingly capable small models lately — and two of its most interesting contenders for AI code review are o3-mini and 4o-mini. Both are designed for lightweight reasoning tasks, but which one is better at finding real bugs in software?
We tested them head-to-head across five programming languages to find out.
🧪 How We Tested
Our benchmark dataset consists of 210 programs, each seeded with a realistic bug. These aren’t easy-to-spot errors — they’re subtle logic flaws, concurrency issues, or semantic edge cases that might slip past linters, tests, and even human reviewers.
The languages we evaluated: Python, Go, Rust, TypeScript, and Ruby. Both models were given the same prompt and context and asked to flag any bugs.
Here are the programs we created for the evaluation:
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined \response\ variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
📊 Results
Overall Performance
- o3-mini: Caught 37 bugs
- 4o-mini: Caught 19 bugs
That’s nearly 2× better performance from o3-mini.
By Language
- Python: o3-mini found 7, 4o-mini found 4
- TypeScript: o3-mini found 7, 4o-mini found 2
- Go: o3-mini found 7, 4o-mini found 3
- Rust: o3-mini found 9, 4o-mini found 4
- Ruby: o3-mini found 7, 4o-mini found 6
The gap is consistent across all languages, though it's slightly narrower in Ruby.
🧠 Why the Gap?
There are a few plausible reasons o3-mini consistently outperformed 4o-mini:
-
Planning and Reasoning: o3-mini appears to be part of OpenAI’s reasoning-first model family, where generation is preceded by internal planning. That extra step often helps catch bugs that aren’t obvious pattern matches.
-
Model Architecture: It’s possible that o3-mini benefits from architectural tweaks or system-level optimizations that give it better performance in logic-heavy tasks like bug detection.
-
Training Differences: Both models were likely trained on overlapping data, but o3-mini might have had more targeted training around software-related reasoning or verification patterns.
In contrast, 4o-mini, while fast and compact, may sacrifice some of that depth for speed or generality.
🧩 Bug Highlight: Python Logic Collapse
Here’s one bug that illustrates the difference in depth between the models.
Test 38: Search Endpoint Filter Collapse
In a Python script handling task search filters, there were two $or keys in the query dictionary. The second one overwrote the first, breaking the compound filtering logic.
-
o3-mini's Output:
"The function relies on having multiple $or conditions in the query dict, but the second overwrites the first. This results in user-based filtering being completely ignored, breaking the intended search behavior."
-
4o-mini's Output:
"The $or logic may not work as intended because the dictionary has duplicate keys."
Both models identified the general issue — but o3-mini offered a much clearer understanding of the logical consequence. It didn’t just spot the key conflict; it explained why it matters.
✅ Final Thoughts
In this benchmark, o3-mini was the clear winner. It found more bugs, gave better explanations, and performed more consistently across all five languages.
4o-mini still shows potential, especially in more surface-level issues or languages with high training coverage. But if your use case demands catching hard bugs — particularly the kind that sneak past pattern-matching — o3-mini is the better choice today.
Greptile runs models like o3-mini in production, automatically reviewing pull requests and catching bugs before they hit prod. Want to see what it finds in your code? Try Greptile — no credit card required.