Building a code review tool: The LLM patterns that actually work

13 May 2026

News

The Data and Analytics team at G-Research provides core infrastructure (databases, streaming, batch processing and workflow systems) used across research, cybersecurity and risk management.

Like most of our platform teams, we primarily use Python. While we’ve had access to AI coding assistants for some time, we hadn’t yet built an LLM-powered system ourselves.

This project changed that.

We developed a team coding standards review tool that checks git diffs against our internal Python coding standards using an LLM. It runs in CI/CD on every feature branch and posts its findings as a PR comment, so engineers can see standards violations before the review process begins.

It began as a 10%-time project, aligned with a broader Data and Analytics Engineering north star focused on delivery, reliability, and reuse.

Moving from a “tool that kind of works” to “a tool you’d trust in CI” involved solving issues that don’t exist in traditional software. This post shares what we learned along the way.

Why an LLM and not a linter?

Linters work well for syntactic rules (“use f-strings instead of % formatting”).

But many of our standards depend on intent, not just structure. For example, a function returning a large dictionary might represent structured data (better as a dataclass) or computed results (acceptable as a dict).

A linter can’t reliably distinguish those cases, but a human reviewer can. This is exactly the gap we used an LLM to fill.

Beyond the technical justification, this was a deliberate learning exercise. We aimed for hands-on experience developing production software on top of LLM APIs, understanding the failure modes, the testing challenges, and the engineering patterns first-hand.

Trustworthy output

Our primary concern was trustworthiness. The tool comments on every PR, so hallucinated or irrelevant findings would quickly undermine trust.

We regarded the LLM as an untrusted component from the very start. We required two guarantees:

Every finding must correspond to a genuine rule
The review scope must be strictly confined to those rules

We began by employing structured output mode with a JSON schema based on our Pydantic models. This restricts the LLM to a fixed format and stops it from generating arbitrary commentary.

Structured output ensures the response format, not its accuracy. An LLM can still generate valid JSON referencing a rule that does not exist.

To resolve this, we convert our standards document into a rules index: a dictionary that maps rule IDs to authoritative metadata. Each rule adheres to a consistent Markdown format.

### PS-020 – Use f-strings for String Formatting

**Category:** Language Usage

**Level:** MUST

**Applies when:** FILE ends with `.py`

Use f-strings for string formatting instead of old-style `%` formatting or `.format()`.

**Automated enforcement:**

Only report when old-style `%` formatting or `.format()` method are introduced.

Every LLM finding is validated against this index using Pydantic’s validation context. If the model invents a rule, misspells a title, or assigns the wrong level, validation rejects it outright.

The standards document is the sole source of truth; the model can suggest, but never define.

We also avoid letting the LLM specify fields we can derive deterministically. For example, our RFC 2119 levels (MUST, SHOULD, MUST NOT) map directly to severity, so we compute that ourselves. This reduces inconsistency and simplifies the prompt.

Engineering and technology

Our engineers design and build advanced platforms and infrastructure that power high-performance solutions, turning complex challenges into real-world results.

We’re hiring!

Explore our vacancies

Graceful recovery

LLMs don’t fail like traditional APIs; they truncate, drift, and generate structurally valid but incorrect output. In a CI pipeline, these issues can’t block developers.

One issue is truncation. When responses exceed token limits, JSON gets cut off midstream. We detect this using the “length” finish reason and retry with a lower findings cap. A guard prevents infinite retries.

Another issue was valid JSON that failed the Pydantic cross-validation described above. Instead of discarding the response, we send it back to the LLM along with the validation errors and ask it to fix only the structural issues. We limit this to a single repair attempt.

The most frustrating issue was provider inconsistency. Some providers return bare JSON, some wrap JSON in markdown code fences, and others use different formats. We now normalise responses by removing wrappers before parsing. It’s a small detail, but critical for supporting multiple providers.

Two-pass verification

Our initial single-pass approach suffered from a precision issue. A review might yield eight findings, but two or three would be false positives.

In a CI environment, false positives are expensive; engineers risk learning to disregard the tool.

We introduced a second LLM call. After the initial review, we send the findings back to the model and ask it to identify which are genuine violations, with examples of false positives.

The key insight was splitting recall and precision into separate prompts. The first pass captures everything; the second filters. This approach was simpler than using a single complex prompt and reflects how human reviewers operate.

Provider abstraction

We evaluated models across different providers, and needed to switch between them easily.

We defined a protocol-based client abstraction using Python’s Protocol class. The review logic depends only on this interface, while implementations handle provider-specific quirks like parameter differences and response formatting.

A model registry serves as the single source of truth for supported models, with each entry defining its configuration, pricing, and aliases. Everything else is automatically derived.

Convenience aliases such as --model fast or --model best allow users to switch models without having to remember provider-specific names.

Testing non-deterministic output

Testing LLM systems means defining what “correct” looks like.

You can’t expect exact equality; outputs differ in wording, order, and line numbers. Instead, we test at the level that matters.

Each regression test comprises:

One rule
One synthetic diff
Expected findings

The harness focuses on structural correctness (rule ID, file, severity) while disregarding non-deterministic fields like message wording and ordering.

For integration tests involving multiple files and rules, we use severity-weighted thresholds:

MUST / MUST NOT: 100% recall required
SHOULD: 75% recall
No false positives
Overall precision > 85%

The asymmetry is deliberate: failing to follow a mandatory rule is worse than overlooking a recommendation.

These tests also serve as a model evaluation framework. New models need to meet these thresholds to be adopted, making model choice a measurable decision.

Cost visibility

Every API call reports token usage, which we total throughout the review. The CLI displays prompt tokens, completion tokens, and an estimated cost based on model pricing.

This allowed us to answer key questions early: is this cheap enough to run in CI? In practice, per-PR costs are low enough to make continuous use feasible, but having visibility made model selection a data-driven decision.

What happened next

The tool is now integrated into our standard Python CI/CD pipeline. It executes on each feature branch and posts findings as PR comments.

We deliberately decided not to block merges. There is still a human involved and space for discretion. In practice, issues are typically dealt with when flagged.

The tool changed how engineers interact with the standards document. Rules that were previously easy to miss are now surfaced automatically in every PR. If a new rule generates too many false positives, this feedback helps improve the rule or prompt.

The test suite has also proven valuable for model selection. We evaluate new models against it, rejecting candidates that regress on key rules.

The tool has also been adopted outside of the Data and Analytics team; the Linux Engineering group are now using it.

Going forward

We’re working on two things:

Making the tool fully language-agnostic, so it can operate on any standards file.
Packaging reusable CI/CD modules to make adoption easier across teams.

Key lessons

Treat LLM output as unverified input.

- Validate against a source of truth and derive what you can deterministically.

Design for failure.

- Truncation, validation failures, and provider quirks are normal operating conditions.

Separate recall and precision.

- Two simple passes outperform one complex prompt.

Abstract the provider.

- A clean interface makes model switching and evaluation trivial.

Test behaviour, not wording.

- Assert structure and thresholds, not exact outputs.

Track cost early.

- Cost visibility enabled informed decisions about models and CI usage.

As the first LLM application for a team of platform engineers, the project delivered more than just the tool itself. We now have practical experience with the failure modes, testing strategies, and cost dynamics of LLM-based systems, and that knowledge directly influences how we evaluate and build on this technology moving forward.

By Austin, Data and Analytics Principal Engineer

Latest news

13 May 2026

Building a code review tool: The LLM patterns that actually work

08 May 2026

G-Research 2026 PhD prize winners: Columbia University

29 Apr 2026

G-Research and UCL formalise PhD partnership at signing ceremony

Latest events

Quantitative engineering
Quantitative research

Pub Quiz: Paris

14 May 2026 Paris - to be confirmed after registration

More info

Quantitative engineering
Quantitative research

Pub Quiz: Amsterdam

25 May 2026 Amsterdam - to be confirmed after registration

More info

Quantitative engineering
Quantitative research

Imperial PhD Careers Fair 2026

28 May 2026 Queen's Tower Rooms, Imperial College London

More info

Building a code review tool: The LLM patterns that actually work

Why an LLM and not a linter?

Trustworthy output

Graceful recovery

Two-pass verification

Provider abstraction

Testing non-deterministic output

Cost visibility

What happened next

Key lessons

Latest news

Latest events

Pub Quiz: Paris

Pub Quiz: Amsterdam

Imperial PhD Careers Fair 2026

Stay up to date with G-Research