Code Review

Mar 2026

What to Look For When Reviewing Agent-Generated Code

Eddie Wangengineering

Why your normal review instincts don't work here
The checklist: seven failure modes to catch
1. Hallucinated APIs and imports
2. Tests that assert tautologies
3. Cargo-culted patterns from training data
4. Over-engineered abstractions
5. Missing edge cases and error handling
6. Confidently wrong business logic
7. Stale or deprecated patterns
Calibrating review depth by trust level
A quick-reference review template
Metrics that tell you if your reviews are working
Where AI review tools fit in
The real skill is knowing what to distrust

Your team just adopted an AI coding agent. PRs are flying in faster than ever. The code looks clean, the tests pass, and the diffs are well-organized. Three weeks later, you're debugging a production incident caused by a function that calls an API endpoint that doesn't exist.

This is the core problem with reviewing agent-generated code. It doesn't fail the way human code fails. Human code has typos, inconsistent style, and obvious logical gaps. Agent code is syntactically polished but can be confidently, subtly wrong in ways that slip past standard review heuristics.

I've been reviewing PRs from Copilot, Cursor, Claude, and Codex across several codebases over the past year. What follows is the checklist I actually use, organized around the specific failure modes I've seen agents produce repeatedly.

Why your normal review instincts don't work here

When you review a human's PR, you rely on a set of heuristics built over years. Messy formatting signals rushed work. Unusual variable names signal inexperience. Large diffs with no tests signal cutting corners. These heuristics work because human code quality correlates with surface-level signals.

Agent code breaks that correlation. An LLM will produce beautifully formatted, well-commented code with descriptive variable names and comprehensive-looking tests, and the underlying logic can still be wrong. The surface quality is always high because the model optimizes for patterns it's seen in high-quality training data. But pattern matching isn't understanding.

As Graphite's analysis of AI in code review puts it well: AI excels at automating low-level, objective checks, but humans remain essential for the high-level, subjective tasks that require business context and architectural judgment. The problem is sharper when the code itself was written by AI. You're not just reviewing code; you're reviewing the output of a system that's very good at looking correct.

You need a different mental model. Instead of scanning for sloppiness, you're scanning for plausible-looking wrongness.

The checklist: seven failure modes to catch

Each of these represents a pattern I've seen in agent-generated PRs across multiple tools and codebases. They aren't theoretical; they're the ones that made it to production.

1. Hallucinated APIs and imports

This is the most common agent failure mode and the easiest to miss. The agent generates a call to a method that doesn't exist on the object, imports a module from a package that was renamed two versions ago, or references an API endpoint that was in the training data but never existed in your codebase.

What to check: Verify every import and every method call against the actual dependency version in your lockfile. Don't trust that foo.bar() exists because it looks reasonable. Jump to the definition. If the agent added a new dependency, check that the specific version it's pulling actually exports the functions being used.

// Agent wrote this — looks perfectly reasonable
import { validateSchema } from '@openapi/validator';

// Problem: @openapi/validator doesn't export validateSchema.
// The actual export is validateOpenAPISchema.
// The agent mixed up the API from an older version in its training data.

2. Tests that assert tautologies

This one is insidious. The agent writes tests. The tests pass. Coverage goes up. Everyone feels good. But the tests don't actually verify behavior; they verify that the code does what the code does.

I've seen agents produce test suites where every assertion essentially mirrors the implementation. The test calls the function, gets the result, and asserts that the result equals... the result of calling the function. Or the mock is set up to return exactly what the assertion expects, so the test is really just testing the mock.

What to check: For every test, ask: "If I introduced a bug in the implementation, would this test catch it?" If the answer is no, the test is decorative. Look specifically for mocks that return hardcoded values matching the assertion, and for tests where the expected value is computed by calling the same code path being tested.

// Tautological test — the mock determines the outcome
it('should calculate the discount', () => {
  const mockPricing = { getDiscount: jest.fn().mockReturnValue(0.15) };
  const result = applyDiscount(100, mockPricing);
  expect(result).toBe(85); // Only passes because the mock returns 0.15
  // If getDiscount had a bug returning 0.50, this test wouldn't know
});

3. Cargo-culted patterns from training data

Agents love patterns. They've seen millions of repositories, and they'll happily apply patterns from one context to a completely different one. You'll see a Redux-style state management pattern show up in a codebase that uses Zustand. You'll see Express middleware conventions in a Fastify project. The code compiles and runs, but it fights the existing architecture.

What to check: Compare the patterns in the agent's PR to your existing codebase conventions. Does the new code follow the same error-handling strategy? Does it use the same data-fetching approach? If the agent introduced a new pattern, is there a good reason, or did it just default to whatever was most common in its training data?

4. Over-engineered abstractions

Ask an agent to add a feature and it'll sometimes give you a factory-pattern-wrapped, strategy-interfaced, dependency-injected architecture astronaut's dream. You asked for a function that sends an email; you got an abstract notification system with pluggable transport layers.

This happens because agents have been trained on enterprise codebases full of these abstractions, and they associate them with "good code." But premature abstraction makes code harder to understand, harder to debug, and harder to change. The irony is that the code is technically correct; it's just solving a problem you don't have.

What to check: Count the number of files and interfaces the agent created relative to the feature's complexity. If there are more abstractions than concrete implementations, push back. Apply the rule of three: don't abstract until you have three concrete cases, not one case and two hypothetical future ones.

5. Missing edge cases and error handling

Agents are great at the happy path. They'll implement the core logic, handle the most obvious error case, and move on. What they consistently miss: null/undefined inputs on optional fields, network timeouts vs. connection refused vs. DNS failures, race conditions in concurrent code, partial failures in batch operations, and the difference between an empty result and an error.

What to check: For every function the agent wrote, mentally trace the unhappy paths. What happens when the input is null? What happens when the network call times out? What happens when the database returns an empty result set? If the agent didn't handle it, that's a bug waiting to happen.

6. Confidently wrong business logic

This is the hardest failure mode to catch because it requires domain knowledge the agent doesn't have. The agent implements a discount calculation that applies the discount after tax instead of before. It implements a permission check that grants access when it should deny. The code is clean, the tests pass (because the tests encode the same wrong assumption), and the behavior is subtly incorrect.

Agents can't read your product spec. They can't sit in on the planning meeting where your team decided that free-tier users get 5 API calls per minute, not 5 per second. They interpolate business rules from code patterns, and when the code pattern is ambiguous, they guess.

What to check: For any PR that touches business logic, verify the implementation against the spec or ticket, not just against whether the code "looks right." Pay special attention to ordering of operations, boundary conditions (is it >= or >?), and anything involving money, permissions, or rate limits.

7. Stale or deprecated patterns

LLMs have a training data cutoff, and the code patterns they favor tend to reflect what was popular when that data was collected. You'll see agents reach for moment.js instead of Temporal or native Intl APIs, use deprecated React lifecycle methods, or suggest Node.js patterns from the callback era. The code works, but it's accumulating tech debt from day one.

What to check: If the agent introduces a new dependency, check when it was last updated. If it uses a language or framework feature, check whether it's the current recommended approach. This is especially important for security-sensitive code where deprecated APIs often have known vulnerabilities.

Calibrating review depth by trust level

Not all agent output deserves the same scrutiny. A raw, first-pass generation from a coding agent that was given a one-line prompt needs more review than output from an agent that has repo context, has been self-reviewed, or has already passed automated validation.

I think about agent trust in three tiers:

Low trust: raw generation. The developer typed a prompt, the agent produced code, and nobody verified it beyond "it compiles." Review this like you'd review code from a new hire on their first week. Check everything: imports, logic, edge cases, tests, patterns.

Medium trust: agent with repo context and self-review. Tools like Cursor and Claude Code that index the full repo produce better output because they can match existing patterns. If the agent also ran the test suite and fixed failures, you can ease up on pattern-matching checks and focus more on business logic and edge cases.

Higher trust: agent output validated by AI review. When an AI code review tool has already scanned the PR for hallucinated APIs, test quality, and pattern consistency, your review can focus on the things only you can evaluate: business logic correctness, architectural fit, and whether this change actually solves the right problem.

A quick-reference review template

Here's the condensed version you can keep open during reviews. For each agent-generated PR, run through these questions:

Imports and APIs: Do all imports resolve? Do all method calls exist on the objects they're called on? Are dependency versions correct?
Test quality: Would these tests fail if I introduced a bug? Are mocks testing real behavior or just wiring?
Pattern consistency: Does this follow existing codebase conventions, or did the agent import a pattern from elsewhere?
Abstraction level: Is the complexity proportional to the feature? Count interfaces vs. implementations.
Edge cases: Trace the unhappy paths. What happens with null inputs, timeouts, empty results, concurrent access?
Business logic: Does the implementation match the spec, not just "look right"? Check boundary conditions, operation ordering, and domain rules.
Freshness: Are dependencies current? Are the patterns used still the recommended approach? Any deprecation warnings?

Metrics that tell you if your reviews are working

If you're going to review agent code differently, you should measure whether your approach is effective. Track these over time:

Agent PR first-pass approval rate. What percentage of agent PRs get approved without revision requests? If this is above 90%, you're probably not reviewing deeply enough. If it's below 30%, the agent's prompts or context need work.
Post-merge defect rate. Compare bugs-per-PR for agent-generated code vs. human code. This is your ground truth. If agent PRs have a higher defect rate, your review process has gaps.
Review time trends. Are agent PRs taking longer or shorter to review over time? If they're getting faster but defect rates aren't rising, your team is calibrating well. If review times are dropping and defects are rising, you've got a cognitive surrender problem.
Revision request categories. Tag your revision requests by failure mode (hallucinated API, tautological test, missing edge case, etc.). Over time, this tells you which agent failure modes are most common in your codebase, so you can focus your review attention.

Where AI review tools fit in

There's a useful irony here: AI is also the best tool for catching the specific failure modes that AI coding agents produce. A human reviewer can miss a hallucinated import because the name looks plausible. An AI code review tool that indexes your full repo can flag it instantly because it knows the import doesn't exist.

This is the approach behind tools like Tenki Code Reviewer, which indexes entire repositories to provide context-aware analysis. When it reviews a PR, it's not just looking at the diff in isolation; it understands how the changed code relates to the rest of the codebase. That makes it effective at catching exactly the patterns in this checklist: hallucinated APIs that don't match your actual interfaces, tests that don't exercise real behavior, and patterns that diverge from your established conventions.

The best workflow I've found: let the AI review tool handle items 1, 2, 3, and 7 from the checklist (hallucinated APIs, test quality, pattern consistency, and freshness). These are mechanical checks that benefit from full-repo context. Then spend your human review time on items 4, 5, and 6 (over-abstraction, edge cases, and business logic). That's where your judgment actually matters.

The real skill is knowing what to distrust

Reviewing agent-generated code isn't harder than reviewing human code. It's just different. The failure modes are different, the signals are different, and the review strategy needs to adapt.

The biggest risk isn't that agents write bad code. It's that they write code that looks so good you stop scrutinizing it. Surface quality creates false confidence. The checklist above exists to counteract that by giving you specific, concrete things to verify instead of relying on vibes.

As agents get better and repo-context tools mature, some items on this checklist will matter less. Hallucinated APIs are already less common with context-aware agents than with vanilla completions. But business logic, edge cases, and architectural judgment aren't going anywhere. Those are still yours.