Caramelo vs Claude Code: benchmark results

Caramelo exists to answer hard questions about code you did not write: how it works, why it behaves that way, and where the relevant logic lives, even when that logic is spread across repositories.

Not to generate code. Not to refactor it. To help someone understand an unfamiliar system quickly, including the messy parts where the answer crosses repository boundaries.

To know whether Caramelo is getting better at that, we needed a benchmark. Claude Code is the comparison point because it is already excellent at answering deep questions about a repository. For Caramelo to be useful, it has to meet that standard while also doing the extra work of finding the right context on its own.

This post walks through the benchmark we built, what we measured, what we found, and where the comparison is still incomplete.

The headline result

In this benchmark run, Caramelo answered cross-repository codebase questions with quality comparable to Claude Code, while returning answers 3x faster and at 30% lower cost per question.

The difference was not just speed: Caramelo also had to find the right repository context on its own.

Result	What it means
Comparable answer quality	Caramelo stayed in the same quality range as Claude Code on codebase questions.
3x faster	Caramelo returned answers faster on median response time in this benchmark harness.
30% cheaper	Caramelo cost less per question based on reported and computed model usage.
Stronger breadth	Caramelo surfaced more related system context, especially when the answer touched multiple parts of the codebase.

Takeaway: Caramelo maintained Claude Code-level answer quality while searching across repositories, answering 3x faster and costing 30% less per question.

What we tested

We tested Caramelo and Claude Code on questions about two repositories from Twitter's open-source algorithm release: the-algorithm and the-algorithm-ml.

These were useful benchmark targets because they are public, non-trivial, and well-studied. They are large enough that finding the right context matters, but open enough that the benchmark can be repeated and compared against detailed public analyses of how the algorithm works.

For each question, we collected one answer from Claude Code and one from Caramelo.

Claude Code was given the relevant repository up front. Caramelo had both repositories indexed and had to find the useful context itself.

That distinction matters because most codebase questions do not start with perfect context. They start with uncertainty: "Where does this live?", "Which service owns this?", "What else does this depend on?", or "Why does this behavior exist?"

Benchmark workflow: questions, solvers, evaluators, and report

Takeaway: The benchmark tested the hard part: answering codebase questions when the user does not already know where the answer lives.

Speed

Caramelo returned answers 3x faster on median response time in this benchmark harness.

Speed matters because code investigation usually happens inside another task. Waiting changes the workflow. Fast answers keep the investigation loop alive.

We treat this as a benchmark result, not a universal latency claim. The harness measured wall-clock time for each solver, and the two systems do different setup work.

Benchmark speed and cost: 3x faster median response time and 30% lower cost per question

Takeaway: Caramelo was 3x faster while preserving answer quality, keeping code investigation inside the developer's flow.

Cost

Caramelo also cost 30% less per question in this benchmark run.

The same logic applies to cost: codebase understanding is not a one-question workflow. Developers rarely ask one perfect question and stop. They ask follow-ups, compare files, narrow scope, and check assumptions.

For Claude Code, we used the model cost reported by the CLI. For Caramelo, we computed cost from token usage across the LLM calls made during each answer.

If every question is slow or expensive, people use the tool less. If the cost curve is reasonable, the tool can become part of the daily investigation workflow.

Takeaway: Lower cost makes it practical to ask the follow-up questions that real codebase understanding requires.

How we measured answer quality

Comparing two AI-generated answers is easy to do badly. A single score hides too much, and a long answer can look better than it is.

We measured answer quality across several dimensions:

Dimension	What it measured
Answer completeness	Did the answer address every distinct part of the question?
Code grounding	Did the answer cite concrete files, functions, classes, or other code artifacts?
Reference coverage	Did Caramelo find the important facts, reasoning steps, and conclusions Claude Code found?
Key insight capture	Did the answer carry forward the insight a developer needs to understand the system?
Answer structure	Was the answer easy to follow, scan, and use?
Breadth and depth	Did the answer cover relevant areas of the system and explain them with enough implementation detail?

Some evaluators scored each answer on its own. Others compared each answer against a reference answer. Others compared the two answers directly.

That gave us a more useful picture than "winner" or "loser." It showed where each answer was complete, where it was grounded in the code, where it was easy to follow, and where it gave the reader broader or deeper understanding.

Answer quality benchmark metrics: completeness tied, factual grounding higher, structure ahead, reference coverage strong

Takeaway: The benchmark measured whether answers build working understanding, not whether they merely sound complete.

Answer completeness

The first thing we measured was whether each answer actually addressed the question.

This is the baseline. A fast answer is not useful if it skips the thing the developer needed to understand. A broad answer is not useful if it never lands the point.

This metric is about scope, not factual correctness. It asks whether the answer attempts to address every distinct part of the question.

On answer completeness, Caramelo and Claude Code were effectively tied.

That is a strong result because Claude Code was pointed at the relevant repository up front. Caramelo had to search across a larger indexed context and still produce an answer that covered the question.

Code grounding

The next metric was code grounding: does the answer give the reader concrete code artifacts they can verify, or does it stay at the level of plausible-sounding explanation?

This is where codebase AI tools earn trust or lose it. A polished explanation is dangerous if it invents architecture, names the wrong file, or explains behavior that does not exist.

Caramelo scored higher than Claude Code on code grounding in this benchmark run.

That matters because Caramelo was not just answering from a known repository. It had to find the relevant context first, then tie its claims back to specific files, functions, classes, config keys, or other code artifacts.

The evaluator does not independently prove that every citation is correct. It measures whether the answer gives the reader something concrete to inspect instead of asking them to trust a generic explanation.

Reference coverage

One evaluator asked a narrow question: when Claude Code found an important fact, reasoning step, or conclusion, did Caramelo find it too?

This was useful because Claude Code sets a high bar for repository-level code understanding. If Caramelo found more context but missed the behavior Claude Code identified, the answer would still fail.

Caramelo scored strongly on reference coverage.

That means Caramelo was not just finding extra context. It also recovered the central facts and system behavior that made the answer useful.

Key insight capture

We also looked at whether Caramelo captured the key insight from Claude Code's answer.

A key insight is the thing a developer needs to carry forward: the architectural decision, root cause, mechanism, dependency, or constraint that changes how they understand the system.

Caramelo performed strongly here too.

That matters because an answer can cover many details and still miss the point. The benchmark is checking whether the answer carries forward the insight that makes the system easier to understand and act on.

Answer structure

We also evaluated answer structure: whether the answer was easy to follow, scan, and use.

Developers usually ask these questions mid-task. They do not want a wall of text. They want a clear path through the code, enough context to understand the answer, and enough signal to decide what to do next.

Caramelo scored slightly ahead on answer structure.

In practice, Caramelo tends to produce answers that show more of the path it took. That is intentional. When the system explores multiple possible context paths, the answer should reveal the shape of what it found instead of only returning the narrowest possible response.

Breadth and depth

The clearest qualitative difference was answer shape.

Claude Code tends to be very specific. When it is pointed at the right repository, it can give a tight answer to a tight question.

Caramelo tends to answer with more breadth. It surfaces related code paths, adjacent context, and supporting details that may not be obvious from the first file you would inspect.

That can be an advantage or a distraction depending on the job.

The benchmark treats breadth and depth separately. Breadth asks which answer covers more relevant areas of the system. Depth asks which answer explains those areas with more implementation detail, mechanisms, edge cases, and rationale.

Breadth and depth benchmark signals: breadth covers more of the system, depth explains mechanisms and rationale

If you already know exactly where the answer lives, a narrower answer may be enough. If you are exploring an unfamiliar system, onboarding into a codebase, debugging across services, or trying to understand the impact of a change, breadth is often what saves time.

Takeaway: Claude Code is strongest when the context boundary is already known. Caramelo is built for the moment before that.

Where the benchmark falls short

The hardest part of benchmarking Caramelo is also the most important part of the product.

Caramelo is meant to answer questions across repositories without making the user decide where the answer lives first. That is difficult to compare directly against a single-repository coding assistant.

The deeper problem is data. A proper cross-repository benchmark needs a real engineering codebase, multiple connected repositories, and questions that genuinely require context from more than one place. Public datasets like that are rare.

The Twitter repositories are a useful starting point, but they are not the final benchmark. We will keep improving the evaluation as we find better ways to measure multi-repo engineering questions.

What we are building toward

Caramelo is for engineering teams that spend too much time recovering context.

The current workflow is still too manual: search GitHub, grep locally, ask a teammate, open stale docs, paste files into an AI tool, realize the answer lives in another repo, repeat.

We want the question to be simpler:

Ask what you need to understand. Caramelo finds the relevant engineering context and gives you an answer grounded in the code.

This benchmark is how we keep ourselves honest while building toward that.

If your team spends too much time figuring out where the answer lives, join the Caramelo Alpha waitlist. If you want to talk through the benchmark, the methodology, or the problem of evaluating cross-repo code understanding, email us at [email protected].

Caramelo vs Claude Code: same quality, 3x faster, 30% cheaper