As AI continues to expand its reach in software engineering, a key frontier is bug detection. Generating code is one thing—but catching real-world bugs requires deeper reasoning, context awareness, and logic tracing.
This post compares two models from OpenAI: the lightweight o1, and the reasoning-enhanced o3-mini, to see how well they detect complex bugs in real codebases.
🧪 The Evaluation Setup
We designed a benchmark of 210 buggy programs, drawn from 16 real-world domains. Each program contained one subtle, realistic bug. Languages tested:
- Python
- TypeScript
- Go
- Rust
- Ruby
Here are the programs we created for the evaluation:
🧪 The Evaluation Dataset
I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined \response\ variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
📊 Results
Total Bugs Caught
- OpenAI o3-mini: 37
- OpenAI o1: 15
Detection by Language
-
Python:
- o1: 2
- o3-mini: 7
-
TypeScript:
- o1: 4
- o3-mini: 7
-
Go:
- o1: 2
- o3-mini: 7
-
Rust:
- o1: 3
- o3-mini: 9
-
Ruby:
- o1: 4
- o3-mini: 7
Across all five languages, o3-mini outperformed o1—sometimes by a wide margin.
💡 Why o3-mini Wins
The key difference? Reasoning.
- o3-mini includes a structured planning step before generating responses. This helps it trace logic, evaluate intent, and surface bugs that go beyond syntax or pattern matching.
- o1, while capable, relies more heavily on memorized patterns. It performs adequately in languages like TypeScript or Python where pattern-based bugs are common, but struggles when logical deduction is required.
The gap widens in languages like Rust and Ruby, where o3-mini’s structured reasoning helps it generalize even with sparse training data.
🐞 A Bug Worth Highlighting
Test #12 — Python: AttributeError in CSV Handling
In this test, a function tried to determine the number of CSV rows using rows.length instead of Python’s correct len(rows)—a subtle but critical mistake.
- o3-mini caught the bug
- o1 missed it
o3-mini’s output:
"The code incorrectly uses rows.length instead of Python's len(rows) in _load_csv_dataset, which will raise an AttributeError when trying to determine the end index for slicing."
This is the kind of bug that requires domain understanding, not just syntax familiarity. o3-mini recognized the language-specific error and explained its runtime consequences—something o1 failed to do.
✅ Final Thoughts
This benchmark shows a clear winner: OpenAI o3-mini is significantly more effective at detecting real-world software bugs.
- Use o3-mini for reasoning-heavy reviews, especially in languages with tricky semantics or lower training coverage.
- Use o1 for lightweight validation or pattern-heavy codebases.
As more AI tools aim to assist with real code reviews, structured reasoning will be the next big differentiator.
Greptile uses models like o3-mini to catch real bugs in PRs—logic flaws, concurrency issues, edge case crashes, and more. Curious what it would find in your codebase? Try Greptile — no credit card required.