AI models are getting better at writing code — but can they review it? In this post, we compare two small models from OpenAI: o1-mini and o3-mini, evaluating their ability to catch real-world bugs in code.
Bug detection requires more than just syntax knowledge — it demands logic, context, and inference. Let’s see which model performs better.
🧪 The Evaluation Dataset
I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.
Here are the programs we created for the evaluation:
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined \response\ variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
📊 Results
Overall Bugs Caught
- OpenAI o1-mini: 11 bugs
- OpenAI o3-mini: 37 bugs
That’s more than a 3× improvement from o1 to o3 — a significant step forward.
By Language
- Python: o1-mini 2, o3-mini 7
- TypeScript: o1-mini 1, o3-mini 7
- Go: o1-mini 2, o3-mini 7
- Rust: o1-mini 2, o3-mini 9
- Ruby: o1-mini 4, o3-mini 7
o3-mini outperformed o1-mini across every language, often by a wide margin.
🧠 Interpretation
The performance gap reflects more than just scale — it highlights an architectural shift.
- o1-mini is lightweight, likely optimized for speed and simple pattern recognition.
- o3-mini appears to integrate structured reasoning, allowing it to follow logic chains, infer intent, and detect subtle issues in concurrency and flow.
That’s why o3-mini shines in languages like Rust and Go — where reasoning about ownership, memory, or parallelism is key. It also performs better in TypeScript and Python, likely due to improved training data coverage and planning capability.
🧩 A Bug Worth Highlighting
Go — Race Condition in NotifyDeviceUpdate
In a smart home backend, device state updates were being sent without proper synchronization, resulting in race conditions.
- o1-mini missed the issue
- o3-mini correctly identified the lack of locking
o3-mini’s output:
"There is no locking around device updates before broadcasting, which could lead to race conditions where clients receive stale or partially updated device state."
This wasn’t a syntax error — it was a concurrency flaw that required simulating how code might behave under load. That’s where reasoning-first models thrive.
✅ Final Thoughts
This benchmark makes the case clear: o3-mini is significantly better than o1-mini at detecting subtle software bugs. It handles logical reasoning, concurrency, and intent much more effectively.
- Use o3-mini if you want a compact model that still offers strong reasoning.
- Use o1-mini if you need a small, fast model and are OK with lower bug coverage.
Greptile uses models like o3-mini in production to catch bugs in pull requests — before they reach production. Want to try it on your repo? Try Greptile — no credit card required.