I’m Daksh, co-founder of Greptile, an AI developer tools company. Our most popular product is our code review bot, and so I spend a lot of time talking to software engineers to better understand what pain points exist in the code review process and what makes the perfect human code reviewer.
Co-authoring this article with me is Dan Goslen, senior engineer at Vouch - an insurtech company that is now perhaps the default choice for new companies seeking commercial insurance. Dan is also the author of "Code Review Champion" and has spent years studying the code review process.
While rule-based code reviewers have been around for some time, AI code reviews are much newer and many believe them to not only be a step-function change over predecessors like SonarQube, but a valid replacement (to various extents) to human code reviewers. In this blog post we want to understand what the code review process is, and if it will ever be automated away in the way that many AI companies claim it will.
An introduction to code reviews
Considering how central code reviews are to professional software development, they are curiously neglected in university CS programs, even as a by-product of another class. The result is that most new engineers are never taught the purpose of code reviews and therefore don’t know how to do them well.
In our view, code reviews have a few purposes:
- Quality control: logic, patterns, style etc.
- Skill Growth: provide feedback to coworkers (especially junior folks), making your team stronger with every review.
To that end, good code reviews are:
- Conversational: technical decisions can be debated between the author and reviewer, generally yielding better decisions assuming both are competent and unreservedly opinionated.
- Clairvoyant: with enough experience, some engineers can predict a change being painful 6 months down the line in a way that isn’t clear yet to most people.
- Consistently thorough: 1-line changes and 100-file rewrites are both reviewed with equally high attention to detail.
- Constructive: “No, because…” instead of “No” makes every in-line comment an opportunity for learning.
Rule-based vs. AI vs. human
1. Rule-based
Rule-based review bots or tools have been around for a while. They range from general-purpose static analysis tools like SonarQube to language-specific versions like RuboCop or eslint to even security-specific tools like semgrep or endor.
These tools offer great insights into your code, but they have very limited capabilities and scope. These tools are designed to match patterns that often indicate bugs or pitfalls in the code being reviewed and they are very good at doing that.
What they are not designed to do, however, is to recognize areas of improvement or even recognize well-written code. They simply say, “I spot this pattern,” or they don’t.
Weaknesses
- One directional; no ability to converse with the tool to understand it's reasoning or convince the tool the rule doesn't apply in the context
- Can produce many false positives, which can lead to disabling rules over time (making the tool less and less useful)
- Often do not provide adequate context as to why an issue needs addressing
Strengths
- Very thorough and fast. It will review all code being changed in a review (or all code, depending on the configuration) regardless of a 1- line or 1000-line change
- Deterministic. If a rule matches, it matches, and the same output is always generated
- The rules are battle-tested and, therefore, hard to “trick” to get buggy code merged. Rule violations must be addressed for code to be merged
Rule-based tools are the noisiest of the three due to the high incidence of false positives. It follows that they create the most work for the author and reviewer. Additionally, you could just have developers run linters locally before they ever open a PR, achieving the same effect.
2. AI Code Reviewers
At their core, AI code review systems involve feeding an LLM the contents of the diff, metadata from the PR, a lot of carefully chosen additional context which can include modules potentially affected by the change, previous commits, linked Issues etc. Out comes a summary of the PR and with some prompting, in-line comments on specific changes too.
A few strengths and weaknesses immediately emerge:
Weaknesses:
- LLMs are non-deterministic and so it can make mistakes/should not be trusted to be exhaustive.
- They are only partially observable and therefore partially controllable, while prompting can help you make them follow certain rules, they are far less reliable than good-old-fashioned if statements.
Strengths:
- They arguably understand what they are seeing, so they have far fewer false positives than rule-based systems.
- With sufficiently powerful contextualization algorithms, they can reason with the ramifications of a change across the codebase.
- They are partially clairvoyant - since they have an understanding of developer intent - they can often (but not always) identify issues that require foresight to detect.
3. Human Code Reviewers
Regardless of any tools you use to review code, most code will be reviewed by at least one other human besides the author. As you well know, humans are very different from software. They are non-deterministic and not always reliable, but the best human reviewers can bring insight and expertise to nearly any code they come across.
Even intelligent AI code reviewers don’t sit in on the meetings, understand the directional changes in the real-world that the software operates in, and the strengths and weaknesses of the org that maintains the codebase. You could give them the entire commit history of your repo but they still wouldn’t have the foresight that a human engineer has.
Weaknesses
- Cannot provide consistently thorough reviews; prone to miss both big and small issues
- Review time can be long, leading to long cycle times for getting code approved and merged
- Interpersonal communication styles and/or existing conflicts with teammates can come into play
Strengths
- Great code reviewers bring incredible insight and perspective to the review. They have years of experience and external context beyond the code itself to guide the code towards better long-term outcomes.
- Great code reviewers don't just find problems or bugs but celebrate well-written code.
- Can engage in dialogue with the author to properly share their knowledge, expertise, and perspective that, when done properly, leads to better code outcomes
How should I be running code reviews?
Technically the most comprehensive way to do a code review is to use all three. The rule-based reviewer catches 100% of pre-defined negatives, the AI reviewer fixes higher-order errors, and the human reviews provides clairvoyance, prescribing changes that are valid only when placed in the context of wider plans that the team has for their codebase.
In practice, this is quite unfeasible. It is useful, therefore, to find an 80/20 here. Additionally, opinions here might differ, and critically, the specific nature of the codebase would also affect which strategy is most optimal.
With that said, this is the 80/20 solution that we arrived at:
Remove rule-based reviewers
Rule-based tools are the noisiest of the three due to the high incidence of false positives. It follows that they are creating the most work for the author and reviewer. Additionally, you could just have developers run linters locally before they ever open a PR, achieving the same effect.
Run an AI reviewer
Surprise-surprise, AI code review company thinks you should use AI code reviewers! I understand the skepticism - and all I ask is that you consider that we don’t believe in AI code review bots because we sell them, but rather that we sell them because we believe in them.
AI reviews catch basic anti-patterns and some higher-order ones, including those humans might miss. Resolve the comments it leaves, then pass it on to a co-worker to review.
Have a co-worker review
Ideally, humans are exclusively doing tasks that only humans can do. Identifying that a FileRead was not wrapped in a try-catch block is arguably a waste of an engineer's time (perhaps the only good that comes of it is that it is an opportunity for mentorship). Recommending caching tokens to reduce the number of authentication requests is a far more complex endeavor. A good engineer also uses their best judgment, based on a variety of factors, to decide if they should comment or not. Would users be okay with this? Does the gain in performance make a difference for this particular type of user and the way in which they use this software? This type of context, which one might consider is “tribal knowledge”, can be applied only by humans.
Then there are the organizational benefits of human code-reviews that cannot be emulated. It is one of the few collaborative steps in the software development life cycle. An opportunity for discussion around design, mentorship, and general team building.
Even with all-powerful AI code reviewers that might someday close the “tribal-knowledge” gap, the experience of collaborating on code before it is merged is intrinsically invaluable for software teams, and is unlikely to be truly replaced.