AI Made Writing Code Cheap. Judgment Didn't Scale.

TL;DR

AI writes code faster than ever, but review quality is falling behind. The gap between velocity and judgment is the real risk most teams haven't named yet.

The Numbers Are Lopsided
What Judgment Actually Means in Code Review
Why Most AI Reviewers Can't Do This
The Three Things a Reviewer Needs That Most Tools Skip
How Tenki Approaches the Judgment Problem
What Good Review Looks Like at Scale
The Risk You're Not Measuring
Code Got Cheap. Make Sure Judgment Doesn't Get Ignored.

Author

Marina Rivosecchi

TL;DR

AI writes code faster than ever, but review quality is falling behind. The gap between velocity and judgment is the real risk most teams haven't named yet.

Here's the productivity story everyone tells: AI coding tools made your team faster. PRs per developer are up. Cycle times are down. Velocity charts look great in the quarterly review.

Here's the story nobody tells: your senior engineers are drowning in review queues, rubber-stamping diffs they don't have time to read, and subtle bugs are reaching production because the one part of the process that requires human judgment didn't get any faster.

AI made writing code cheap. It didn't make judgment cheap. And the gap between those two is where real engineering risk now lives.

The Numbers Are Lopsided

AI coding assistants now contribute to roughly 41-46% of all new code, according to multiple industry reports from late 2025 and early 2026. GitHub reported a 29% year-over-year increase in merged pull requests. Salesforce's engineering team saw code volume jump 30%, with PRs regularly expanding beyond 20 files and 1,000 lines of change.

Meanwhile, review is going the other direction. Faros AI analyzed data from over 10,000 developers across 1,255 teams and found that teams with high AI adoption completed 21% more tasks and merged 98% more PRs, but PR review time increased 91%. CodeRabbit's analysis found that AI-generated PRs contain 1.7x more issues per pull request compared to human-written code.

The output doubled, the defect rate went up, and the review bottleneck got worse. That's not a productivity gain. That's quality debt accumulating in the one place teams can least afford it.

What Judgment Actually Means in Code Review

Most conversations about AI code review treat it as a pattern-matching problem. Spot the unused import. Flag the missing null check. Suggest a more idiomatic function name. Linters have done this for years. LLMs do it slightly better.

But the things that make a senior engineer's review valuable have almost nothing to do with pattern matching. Judgment in code review means asking questions like:

This PR touches the payment flow. We had an incident here six months ago. Does this change reintroduce that risk?
This refactor looks clean in isolation, but it breaks an assumption three other services rely on.
This new dependency has a permissive license. Legal hasn't approved it, and it pulls in transitive deps we don't control.
The code is correct, but the approach contradicts a decision we made deliberately two quarters ago. Reopening that decision has costs the PR author probably doesn't see.

None of these are syntax problems. They require context: the history of the codebase, the architecture of the system, the organizational decisions that shaped it, and the consequences of getting it wrong. That's what judgment means. And it doesn't scale when the only thing accelerating is output.

Why Most AI Reviewers Can't Do This

The first generation of AI code review tools works like this: receive a diff, send it to an LLM, post the comments. Some are more polished than others. Some let you configure rules. But structurally, they share the same limitation.

They don't know your codebase.

An LLM reviewing a diff in isolation is like asking a contractor to evaluate plumbing changes without showing them the floor plan. They can tell you whether a pipe joint looks wrong. They can't tell you it's going to flood the basement because the main shutoff valve is on the other side of a wall they've never seen.

Salesforce's engineering team described this precisely in their January 2026 writeup. They found that the file-by-file review model broke down under AI-generated load because reviewers couldn't reconstruct the intent behind changes that spanned backend logic, configuration, tests, and UI components simultaneously. The changes didn't preserve what they called "conceptual coherence." Reviewers had to infer purpose from disconnected fragments.

Most AI review tools don't even try to solve this. They operate on the diff, not the codebase. They have no memory of past PRs, no awareness of recent incidents, no model of your architecture. They produce comments that are technically plausible but contextually hollow.

The volume problem makes this worse, not better. As Ankit Jain wrote on Latent Space in March 2026: "Teams produce more code, then spend more time reviewing it. There is no way we win this fight with manual code reviews." He's right that the old model is breaking. But replacing human review with AI review that has the same blind spots just moves the problem.

The Three Things a Reviewer Needs That Most Tools Skip

If you break down what makes a good review comment, three ingredients come up repeatedly.

Codebase awareness. The reviewer needs to understand how the changed files relate to the rest of the system. Not just the diff. The call graph, the data flow, the modules that depend on what's changing. Without this, you can't spot the PR that looks harmless in isolation but breaks an invariant three layers up.

Risk calibration. Not all code changes carry the same risk. A formatting tweak to a README doesn't need the same scrutiny as a change to the auth middleware. Good reviewers triage instinctively. They spend thirty seconds on the safe stuff and thirty minutes on the dangerous stuff. Most AI reviewers comment uniformly across the entire diff, which trains developers to ignore all of them.

Signal discipline. The best reviewers say less, not more. They don't comment on every line. They identify the one or two things that actually matter and explain why. A review tool that posts fifteen nitpicks and buries the one real bug is actively making things worse. Developers start skipping the comments entirely, and the real issue slips through.

These three things separate a code review tool from a code review system. They're exactly what gets lost when you scale review by throwing another LLM at the diff.

How Tenki Approaches the Judgment Problem

Tenki's Code Reviewer was built around a different premise: review comments are only useful if they're informed by the full codebase, not just the diff.

When you connect a repository, Tenki indexes the entire codebase: file relationships, architectural patterns, dependencies between modules. It doesn't start from zero on every PR. It starts from a model of your system and evaluates the change against that model.

That's the difference between asking "does this code look correct?" and asking "does this change make sense given everything else in this repository?" The first question catches syntax bugs. The second catches the kind of systemic issues that actually cause incidents.

Tenki also accepts custom context: team-specific guidelines, architectural decisions, areas of the codebase flagged as high-risk. This isn't just rule configuration. It's giving the reviewer the kind of institutional knowledge that a new hire wouldn't have but your staff engineer does.

Crucially, Tenki filters for severity. It doesn't dump a wall of comments on every PR. It raises critical and high-severity issues, the ones that actually need human attention, and stays quiet on the rest. That's signal discipline. It's the difference between a reviewer you trust and one you mute.

What Good Review Looks Like at Scale

The goal isn't to remove humans from code review. It's to make sure the humans who are reviewing spend their time on the things that actually require human judgment.

A team running well with AI-augmented review looks something like this:

Low-risk PRs like dependency bumps, formatting, and documentation get reviewed automatically with high confidence. Humans approve without deep reading.
Medium-risk PRs get a focused summary of what changed architecturally, with specific areas flagged for human attention. The reviewer starts from context instead of building it from scratch.
High-risk PRs touching security-sensitive code, cross-service boundaries, or areas with incident history get elevated. The AI flags exactly why, with file- and line-level references. Humans do the deep review, but they know where to look.

Salesforce reached a similar conclusion when they rebuilt their internal review system. As they put it: "The response was not to automate judgment. Instead, it was to rebuild review as a system aligned with how developers actually reason about change." The human stays in the loop, but the loop is designed so their attention goes where it matters.

The Risk You're Not Measuring

Most engineering orgs track velocity. Fewer track review depth. Almost none track the gap between the two.

That gap is where Atomic Robot's February 2026 analysis hit a nerve. They compared what's happening in software to automation complacency in aviation and radiology: "AI moved the hard part from writing to reviewing. The vigilance failures plaguing pilots and radiologists now hit code review." When the system generates most of the output and humans are supposed to catch errors, but the volume exceeds what humans can meaningfully process, you get exactly the kind of failure that's hardest to detect until something breaks in production.

Stack Overflow's 2025 developer survey captured the same tension from the practitioner side: AI usage rose to over 84% of developers, but trust didn't follow. Developers use the tools and don't trust the output. The gap between usage and trust is precisely where review quality erodes.

If you're a CTO looking at DORA metrics and celebrating faster lead times, ask yourself: has review time per line of code changed? Are your senior engineers spending more or less time per PR? Are incident rates for AI-assisted changes tracking differently from human-written ones? If you don't know the answers, the judgment gap is growing and you can't see it.

Code Got Cheap. Make Sure Judgment Doesn't Get Ignored.

The industry is at an inflection point. AI made code generation cheap and fast, but the review processes that keep code safe were built for a world where humans wrote at human speed. That world is gone.

The next generation of review tools won't succeed by generating more comments. They'll succeed by understanding your codebase deeply enough to generate fewer, better ones. The ones that surface the risks your team would have caught if they had unlimited time and perfect memory.

That's what scaling judgment actually looks like. Not automating away the human. Giving the human what they need to be effective at the speed the code now moves.

Tenki Code Reviewer reads your full codebase, reviews every PR for bugs and security issues, and posts comments directly in GitHub. It starts working on your next pull request.