Large language models are getting better at generating code — but can they debug it?
In this post, we compare two of OpenAI’s best small models — o3-mini and 4o — to see how well they perform at detecting subtle bugs in real-world software. Both models are capable, but they take different approaches under the hood: o3-mini is designed for structured reasoning, while 4o is optimized for speed and performance across a wide range of tasks.
🧪 Benchmark Setup
We built a benchmark of 210 small programs, each seeded with a real, hard-to-spot bug. These weren’t toy problems — they were realistic logic errors, edge cases, and misuses of APIs that could easily slip through linters, tests, and even manual review.
We wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.
Here are the programs we created for the evaluation:
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined \response\ variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
📊 Results
Total Bugs Detected
- OpenAI o3-mini: 37
- OpenAI 4o: 20
Breakdown by Language
- Python: o3-mini found 7, 4o found 6
- TypeScript: o3-mini found 7, 4o found 4
- Go: o3-mini found 7, 4o found 4
- Rust: o3-mini found 9, 4o found 3
- Ruby: o3-mini found 7, 4o found 3
The gap is especially wide in Rust and Ruby — languages with less LLM training coverage, where pattern matching often falls short.
🧠 Observations
The data points to a clear trend: o3-mini outperforms 4o across the board, and the difference grows in languages with lower training representation. This is likely due to the architectural difference between the models.
-
o3-mini is part of OpenAI’s reasoning-first model family — it takes a planning step before generating its response. This gives it a leg up in situations where logic, structure, or intent need to be inferred — like in bug detection.
-
4o, while powerful, seems more tuned for broad task coverage and performance speed. Its bug detection suffers slightly in areas that require deep structural understanding.
In high-resource languages like Python and Go, where there’s ample training data, 4o performs respectably. But in domains like Rust and Ruby, o3-mini’s reasoning ability shines through.
🧩 Example: Ruby Audio Bug
Program 33 involved a Ruby audio processing library. In the TimeStretchProcessor class, the bug was in the normalize_gain calculation — it used a fixed value instead of scaling based on the stretch_factor, resulting in audio with inconsistent amplitude.
-
o3-mini caught the issue immediately. It explained that the gain logic didn’t respect time-stretching parameters, leading to mismatched output.
-
4o missed it entirely.
This was a logic error embedded in a domain-specific pattern — not a syntax problem. It’s exactly the kind of bug where reasoning models provide real value.
✅ Conclusion
Both o3-mini and 4o are capable models — but when it comes to AI code review and bug detection, o3-mini has a clear edge. Its structured reasoning lets it catch more complex bugs, especially in languages where data is sparse and logic matters more than patterns.
For engineering teams relying on LLMs to improve code quality, o3-mini is the safer bet — especially for logic-heavy or backend-heavy stacks.
Greptile uses models like o3-mini to review real codebases in production, surfacing bugs before they ever hit prod. Want to see what it finds in your pull requests? Try Greptile — no credit card required.