🧠 Introduction
AI-assisted code generation has made huge strides—but what about AI-powered code review?
In this post, we compare OpenAI’s o1-mini and 4o-mini, two compact LLMs, on their ability to detect real bugs in real code. Unlike generation, bug detection requires understanding logic, context, and intent. It’s a different kind of challenge—one that tests a model’s reasoning capabilities.
🧪 The Evaluation Dataset
I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.
Here are the programs we created for the evaluation:
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined \response\ variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
📊 Results
We ran both models on a benchmark of 210 buggy programs across five languages: Go, Python, TypeScript, Rust, and Ruby.
- o1-mini: 11 bugs detected
- 4o-mini: 19 bugs detected
By Language
-
Go:
- o1-mini: 2
- 4o-mini: 3
-
Python:
- o1-mini: 2
- 4o-mini: 4
-
TypeScript:
- o1-mini: 1
- 4o-mini: 2
-
Rust:
- o1-mini: 2
- 4o-mini: 4
-
Ruby:
- o1-mini: 4
- 4o-mini: 6
In every language, 4o-mini outperformed o1-mini. The gaps weren’t massive—but they were consistent, pointing to a general edge in bug detection.
💡 Interpretation
The difference likely comes down to reasoning.
- o1-mini relies more heavily on pattern recognition. It performs reasonably well when bugs resemble known structures or common mistakes—especially in syntactically predictable languages.
- 4o-mini shows signs of deeper logical reasoning. It performs better in high-context situations where identifying a bug means understanding what the code is supposed to do, not just what it looks like.
That ability to generalize—rather than just memorize—gives 4o-mini an edge, particularly in languages like Ruby or Rust, where logic often deviates from obvious patterns.
🐞 A Bug Worth Highlighting
Test 1: Gain Calculation in Ruby Audio Library
In a TimeStretchProcessor class, a bug involved miscalculating normalize_gain. Instead of scaling based on stretch_factor, the code used a fixed value—resulting in audio that was too loud or too quiet depending on speed.
- o1-mini missed it
- 4o-mini flagged it
This wasn’t a syntax issue. It was a logic error in audio domain math, requiring the model to reason about the effect of one variable (stretch_factor) on another (gain). 4o-mini connected the dots—o1-mini didn’t.
✅ Conclusion
Both models are fast, compact, and useful—but 4o-mini is clearly stronger for AI code review.
Its consistent improvement across languages—and its ability to reason through logic and intent—make it a better choice for detecting the kind of bugs that matter in production.
As AI continues to improve, we expect this trend to grow: the best AI reviewers won’t just recognize patterns—they’ll think.
Want to try reasoning-first LLMs on your pull requests?
Check out Greptile — where AI meets real-world engineering.