# Caramelo.dev

> Caramelo.dev helps engineering teams make unfamiliar codebases understandable.

## Product Definition

Caramelo.dev is a GitHub-connected codebase context platform for engineering teams. It indexes repository knowledge for semantic search and lets teams ask AI chat questions about architecture, implementation details, debugging paths, and code ownership.

Caramelo.dev also exposes Model Context Protocol (MCP) access so external AI assistants can use repository context when answering engineering questions. The product is currently presented as an alpha build focused on code investigation and codebase understanding.

## Who Caramelo Is For

Caramelo.dev is for engineering teams that maintain codebases with important context spread across repositories, pull requests, issues, and teammates. The primary users are developers who need to understand unfamiliar code, debug regressions, plan implementation work, and onboard into existing systems.

Secondary users include engineering managers, technical leads, product managers, support engineers, and designers who need trustworthy engineering context without interrupting developers for every codebase question.

## Problems Caramelo Solves

Engineering teams lose time when codebase context is hard to find. Developers often rely on manual search, stale documentation, memory from teammates, and generic AI chat that does not have repository context.

Caramelo.dev helps teams recover engineering context, understand why code behaves the way it does, and turn scattered repository knowledge into answers that support debugging, onboarding, architecture review, and implementation planning.

## Core Workflows

- Code investigation: ask how a feature, bug, or architectural pattern works across a repository.
- Debugging support: trace likely root causes and impacted files with repository context.
- Onboarding: help engineers become productive in unfamiliar codebases faster.
- Implementation planning: estimate scope, identify dependencies, and surface risks before changing code.
- AI assistant context: use MCP to connect external AI assistants to Caramelo repository knowledge.

## Use Cases

- Multi-repository code search for teams that need answers across services and repositories.
- AI codebase understanding for unfamiliar systems, onboarding, debugging, and implementation planning.
- MCP context for coding agents that need current repository knowledge before answering engineering questions.
- Semantic code search that helps teams find architecture, dependencies, ownership, and implementation paths.

## Target Queries

Caramelo.dev is designed to be relevant for questions such as "what is a codebase context platform?", "how do coding agents search across repositories?", "best tools to understand unfamiliar codebases", "multi-repository semantic code search", and "MCP server for GitHub repositories".

## How Caramelo Is Different

Manual search finds files, but it does not explain system behavior. Stale documentation can describe an old version of the codebase. Generic AI chat can reason well, but it does not automatically know private repository context.

Caramelo.dev combines GitHub repository indexing, semantic search, AI chat, and MCP access so answers are grounded in current codebase context instead of memory, guesswork, or disconnected prompts.

## Public Pages

- [Home](https://caramelo.dev/) - Product overview for engineering teams that need codebase context.
- [Blog](https://caramelo.dev/blog) - Published articles about Caramelo.dev, engineering context, and code investigation.
- [Alpha](https://caramelo.dev/alpha) - Early access information for alpha availability.
- [Cookies](https://caramelo.dev/cookies) - Cookie information.
- [Sitemap](https://caramelo.dev/sitemap.xml) - Public crawl discovery map.

## Published Blog Posts

### Caramelo vs Claude Code: same quality, 3x faster, 30% cheaper

URL: https://caramelo.dev/blog/caramelo-vs-claude-code-benchmarks
Author: Frederico
Published: 2026-05-30
Summary: Caramelo matched Claude Code on answer quality while searching across repositories, responding 3x faster, and costing 30% less per question.

Caramelo exists to answer hard questions about code you did not write: how it works, why it behaves that way, and where the relevant logic lives, even when that logic is spread across repositories.

Not to generate code. Not to refactor it. To help someone understand an unfamiliar system quickly, including the messy parts where the answer crosses repository boundaries.

To know whether Caramelo is getting better at that, we needed a benchmark. Claude Code is the comparison point because it is already excellent at answering deep questions about a repository. For Caramelo to be useful, it has to meet that standard while also doing the extra work of finding the right context on its own.

This post walks through the benchmark we built, what we measured, what we found, and where the comparison is still incomplete.


## The headline result

In this benchmark run, Caramelo answered cross-repository codebase questions with quality comparable to Claude Code, while returning answers 3x faster and at 30% lower cost per question.

The difference was not just speed: Caramelo also had to find the right repository context on its own.

| Result | What it means |
| --- | --- |
| Comparable answer quality | Caramelo stayed in the same quality range as Claude Code on codebase questions. |
| 3x faster | Caramelo returned answers faster on median response time in this benchmark harness. |
| 30% cheaper | Caramelo cost less per question based on reported and computed model usage. |
| Stronger breadth | Caramelo surfaced more related system context, especially when the answer touched multiple parts of the codebase. |



> **Takeaway:** Caramelo maintained Claude Code-level answer quality while searching across repositories, answering 3x faster and costing 30% less per question.

## What we tested

We tested Caramelo and Claude Code on questions about two repositories from Twitter's open-source algorithm release: `the-algorithm` and `the-algorithm-ml`.

These were useful benchmark targets because they are public, non-trivial, and well-studied. They are large enough that finding the right context matters, but open enough that the benchmark can be repeated and compared against detailed public analyses of how the algorithm works.

For each question, we collected one answer from Claude Code and one from Caramelo.

Claude Code was given the relevant repository up front. Caramelo had both repositories indexed and had to find the useful context itself.

That distinction matters because most codebase questions do not start with perfect context. They start with uncertainty: "Where does this live?", "Which service owns this?", "What else does this depend on?", or "Why does this behavior exist?"

![Benchmark workflow: questions, solvers, evaluators, and report](/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTYsInB1ciI6ImJsb2JfaWQifX0=--3c00809054fabaf9e8ab0f6ad1fc7e180fc6f622/benchmark-methodology.png)


> **Takeaway:** The benchmark tested the hard part: answering codebase questions when the user does not already know where the answer lives.

## Speed

Caramelo returned answers 3x faster on median response time in this benchmark harness.

Speed matters because code investigation usually happens inside another task. Waiting changes the workflow. Fast answers keep the investigation loop alive.

We treat this as a benchmark result, not a universal latency claim. The harness measured wall-clock time for each solver, and the two systems do different setup work.

![Benchmark speed and cost: 3x faster median response time and 30% lower cost per question](/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTUsInB1ciI6ImJsb2JfaWQifX0=--7553b46c40914c2f8dfb741d65f9768b1a2e7952/benchmark-speed-cost.png)

> **Takeaway:** Caramelo was 3x faster while preserving answer quality, keeping code investigation inside the developer's flow.

## Cost

Caramelo also cost 30% less per question in this benchmark run.

The same logic applies to cost: codebase understanding is not a one-question workflow. Developers rarely ask one perfect question and stop. They ask follow-ups, compare files, narrow scope, and check assumptions.

For Claude Code, we used the model cost reported by the CLI. For Caramelo, we computed cost from token usage across the LLM calls made during each answer.

If every question is slow or expensive, people use the tool less. If the cost curve is reasonable, the tool can become part of the daily investigation workflow.

> **Takeaway:** Lower cost makes it practical to ask the follow-up questions that real codebase understanding requires.

## How we measured answer quality

Comparing two AI-generated answers is easy to do badly. A single score hides too much, and a long answer can look better than it is.

We measured answer quality across several dimensions:

| Dimension | What it measured |
| --- | --- |
| Answer completeness | Did the answer address every distinct part of the question? |
| Code grounding | Did the answer cite concrete files, functions, classes, or other code artifacts? |
| Reference coverage | Did Caramelo find the important facts, reasoning steps, and conclusions Claude Code found? |
| Key insight capture | Did the answer carry forward the insight a developer needs to understand the system? |
| Answer structure | Was the answer easy to follow, scan, and use? |
| Breadth and depth | Did the answer cover relevant areas of the system and explain them with enough implementation detail? |

Some evaluators scored each answer on its own. Others compared each answer against a reference answer. Others compared the two answers directly.

That gave us a more useful picture than "winner" or "loser." It showed where each answer was complete, where it was grounded in the code, where it was easy to follow, and where it gave the reader broader or deeper understanding.

![Answer quality benchmark metrics: completeness tied, factual grounding higher, structure ahead, reference coverage strong](/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTcsInB1ciI6ImJsb2JfaWQifX0=--7a843878569677eeae9c6f86aecf880288889874/benchmark-quality-metrics.png)

> **Takeaway:** The benchmark measured whether answers build working understanding, not whether they merely sound complete.

## Answer completeness

The first thing we measured was whether each answer actually addressed the question.

This is the baseline. A fast answer is not useful if it skips the thing the developer needed to understand. A broad answer is not useful if it never lands the point.

This metric is about scope, not factual correctness. It asks whether the answer attempts to address every distinct part of the question.

On answer completeness, Caramelo and Claude Code were effectively tied.

That is a strong result because Claude Code was pointed at the relevant repository up front. Caramelo had to search across a larger indexed context and still produce an answer that covered the question.

## Code grounding

The next metric was code grounding: does the answer give the reader concrete code artifacts they can verify, or does it stay at the level of plausible-sounding explanation?

This is where codebase AI tools earn trust or lose it. A polished explanation is dangerous if it invents architecture, names the wrong file, or explains behavior that does not exist.

Caramelo scored higher than Claude Code on code grounding in this benchmark run.

That matters because Caramelo was not just answering from a known repository. It had to find the relevant context first, then tie its claims back to specific files, functions, classes, config keys, or other code artifacts.

The evaluator does not independently prove that every citation is correct. It measures whether the answer gives the reader something concrete to inspect instead of asking them to trust a generic explanation.

## Reference coverage

One evaluator asked a narrow question: when Claude Code found an important fact, reasoning step, or conclusion, did Caramelo find it too?

This was useful because Claude Code sets a high bar for repository-level code understanding. If Caramelo found more context but missed the behavior Claude Code identified, the answer would still fail.

Caramelo scored strongly on reference coverage.

That means Caramelo was not just finding extra context. It also recovered the central facts and system behavior that made the answer useful.

## Key insight capture

We also looked at whether Caramelo captured the key insight from Claude Code's answer.

A key insight is the thing a developer needs to carry forward: the architectural decision, root cause, mechanism, dependency, or constraint that changes how they understand the system.

Caramelo performed strongly here too.

That matters because an answer can cover many details and still miss the point. The benchmark is checking whether the answer carries forward the insight that makes the system easier to understand and act on.

## Answer structure

We also evaluated answer structure: whether the answer was easy to follow, scan, and use.

Developers usually ask these questions mid-task. They do not want a wall of text. They want a clear path through the code, enough context to understand the answer, and enough signal to decide what to do next.

Caramelo scored slightly ahead on answer structure.

In practice, Caramelo tends to produce answers that show more of the path it took. That is intentional. When the system explores multiple possible context paths, the answer should reveal the shape of what it found instead of only returning the narrowest possible response.

## Breadth and depth

The clearest qualitative difference was answer shape.

Claude Code tends to be very specific. When it is pointed at the right repository, it can give a tight answer to a tight question.

Caramelo tends to answer with more breadth. It surfaces related code paths, adjacent context, and supporting details that may not be obvious from the first file you would inspect.

That can be an advantage or a distraction depending on the job.

The benchmark treats breadth and depth separately. Breadth asks which answer covers more relevant areas of the system. Depth asks which answer explains those areas with more implementation detail, mechanisms, edge cases, and rationale.

![Breadth and depth benchmark signals: breadth covers more of the system, depth explains mechanisms and rationale](/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTgsInB1ciI6ImJsb2JfaWQifX0=--7bf8d2bfe76a405eb4e6f1e92126566e839cc906/benchmark-breadth-depth.png)


If you already know exactly where the answer lives, a narrower answer may be enough. If you are exploring an unfamiliar system, onboarding into a codebase, debugging across services, or trying to understand the impact of a change, breadth is often what saves time.

> **Takeaway:** Claude Code is strongest when the context boundary is already known. Caramelo is built for the moment before that.


## Where the benchmark falls short

The hardest part of benchmarking Caramelo is also the most important part of the product.

Caramelo is meant to answer questions across repositories without making the user decide where the answer lives first. That is difficult to compare directly against a single-repository coding assistant.

The deeper problem is data. A proper cross-repository benchmark needs a real engineering codebase, multiple connected repositories, and questions that genuinely require context from more than one place. Public datasets like that are rare.

The Twitter repositories are a useful starting point, but they are not the final benchmark. We will keep improving the evaluation as we find better ways to measure multi-repo engineering questions.


## What we are building toward

Caramelo is for engineering teams that spend too much time recovering context.

The current workflow is still too manual: search GitHub, grep locally, ask a teammate, open stale docs, paste files into an AI tool, realize the answer lives in another repo, repeat.

We want the question to be simpler:

Ask what you need to understand. Caramelo finds the relevant engineering context and gives you an answer grounded in the code.

This benchmark is how we keep ourselves honest while building toward that.

If your team spends too much time figuring out where the answer lives, join the Caramelo Alpha waitlist. If you want to talk through the benchmark, the methodology, or the problem of evaluating cross-repo code understanding, email us at hello@caramelo.dev.

### Why We’re Building Caramelo

URL: https://caramelo.dev/blog/why-we-re-building-caramelo
Author: Marcos
Published: 2026-05-24
Summary: Why coding agents need multi-repo context, how Caramelo helps them find the right knowledge, and why a Brazilian dog became our name.

# Context for Coding Agents
There is a lot of talk right now about AI making developers faster.

We agree with that, but speed is not the whole story.

In our day-to-day work, the slowest part of engineering is often not typing the code. It is building enough understanding to change the right thing without breaking something else.

That work is familiar to anyone who has spent time inside a large system.

Where does this live? Why does this service behave this way? Who owns this logic? Is this still used? Is the real behavior in this repo, or in the worker, or in the service it calls three steps later?

Those questions can burn an afternoon. They turn into Slack threads, calls, old pull requests, half-remembered decisions, and long walks through files hoping the next one is the right one.

## From code generation to code understanding

When we started using Claude Code heavily, the part that changed our workflow was not only code generation. It was code understanding.

Being able to ask questions against a repository made exploration feel conversational. Instead of starting with a file tree and a guess, we could start with the question we actually had.

That sounds small, but when you do it every day, it changes the shape of the work.

We are Marcos and Frederico, longtime friends who have worked together across Streaming, Machine learning, AI Agents & Evals, all in very large codebases spread across many repositories. Between us, we have each spent more than a decade inside systems where the hardest problems were rarely isolated to one repository.

The real challenge was context: knowing the history, the boundaries, the conventions, the owners, and the path from a product question to the code that actually answers it.

## The missing context

Claude Code made us excited about what coding agents could become. But it also made one limitation very clear: most real engineering systems do not fit neatly inside one repository.

They are spread across many repos, services, jobs, APIs, dashboards, queues, deployment paths, and operational assumptions.

A single agent looking at a single repo can help a lot, but it still misses the wider system that developers actually work inside.

That is why we started building Caramelo.

## What Caramelo does

Caramelo is built around a simple idea: coding agents need better context before they can do their best work.

Caramelo connects repositories, absorbs the knowledge inside them, and helps agents retrieve the pieces that matter for the question in front of the developer.

In plain terms, it is a context orchestration layer for engineering systems: the part that knows where to look, how the pieces relate, and what context should be brought back before an agent tries to answer.

The goal is to remove the time we lose getting oriented, and making all the great coding agents we know tool even better.

Less digging through repos. Less guessing which service matters. Less time creating local hacks to give context from multiple repos to your agent. Less **Time and Tokens**.

## Why Caramelo

The name is part of the story.

In Brazil, caramelo is not only a color. It is also the _vira-lata caramelo_: the caramel-colored mutt almost everyone recognizes. The dog outside the bakery. The dog crossing the street like it owns the neighborhood. The dog that shows up in memes, in adoption stories, and even in campaigns to put it on Brazilian money.

Not a breed exactly. More like a vibe.

Street-smart. Friendly. Resilient. The kind of dog that somehow belongs to nobody and everybody at the same time.

![caramelo-doguinho Medium](/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6NCwicHVyIjoiYmxvYl9pZCJ9fQ==--1a09f69b75e9e48bbacc3479bba46630a3330e8c/caramelo-doguinho%20Medium.png)


That felt right for what we are building.


## What comes next

We are early, but the retrieval foundation is ready for real engineering work.

Now we want to battle-test it where the problem is most obvious: teams with many repositories, many moving parts, and too much important context living between them.

This blog will be our field notes. We will share what we are building, what we are learning, where agents still get lost, and how Caramelo changes as we put it in front of real multi-repo systems.

There is a lot more coming.