Agents and agentic pipelines are being built and released at an accelerated pace like never before. But how can we determine how good they are?
Why evaluating AI agents matter
Think of an AI agent like a fitness tracker. The tracker always works, it never says “Unable to fetch accurate data” and will always provide a reading when you press the button, but most of the time, these readings are hardly accurate. The same goes for Agents, every team has one for their specific use case, and they always generate a response. But very few know how well they perform in real-world scenarios. We assume they work, without really knowing how well.
Developing robust AI agent evaluation frameworks has become essential for several compelling reasons:
-
- Quality assurance: As AI agents become more autonomous and handle critical tasks, systematic evaluation ensures they meet reliability standards before deployment
- Desempenho benchmarking: Objective metrics allow for consistent performance tracking across model iterations and comparison against industry standards
- Targeted improvement: Detailed analysis pinpoints specific weaknesses, enabling efficient allocation of development resources
- Alignment verification: Ensures agents act according to design intentions and don’t develop unexpected behaviors in edge cases
- Conformidade e risk gerenciamento: Facilitates documentation of agent capabilities and limitations for regulatory and legal requirements
- Investment justification: Provides quantitative evidence of AI system value and improvement to stakeholders and decision-makers
Breaking down the agent evaluation process
Evaluating an agent, whether it’s a chatbot, a retrieval-augmented generation (RAG) system, or a tool-using LLM requires a systematic approach to ensure your model is accurate, reliable, and robust. Let’s walk through the typical steps of how an agentic workflow can be evaluated:
-
- Prepare the ground truth: Use public datasets if they fit your use case, or preferably, generate synthetic data tailored to your agent’s functionality
- Run the agent on the dataset: Feed each input/query from the dataset into the agent
- Log all agent activity: final responses, tool calls, outputs, and reasoning steps if applicable
- Create and perform an experiment: Evaluate the agent responses. Compare the agent’s responses with expected/reference answers. Handle partial matches, nested outputs, or structured data using custom comparison logic if needed. Aggregate results and compute evaluation metrics (accuracy, success rate, etc.).
- Persist results: Store the evaluation results so that it can be reproduced and referred later. Identify failure points from the metrics

An ideal modular workflow using the framework
Prepare the ground truth
One of the major challenges in evaluating agentic systems is the availability of ground truth i.e, references to which you compare the agent’s responses to evaluate it. To evaluate a typical agent effectively, you need comprehensive ground truth data covering tool calls, tool parameters, tool outputs, and final answers. Collecting and labelling all these ground truth data will be time consuming and requires a large amount of human contribution.
If we don’t have the resources to prepare a curated ground truth for the evaluation, we have two options. Either we can leverage publicly available evaluation datasets or we can generate synthetic data that can be used as the ground truth.
A synthetic data generator handles the creation of curated ground truth from raw documents, here raw documents refers to any document(s) that contains information about the agent’s use case. For example, if the agent is a travel planner, then a document containing all the locations along with information about the location can be provided as the input to the data generator that outputs a set of question-answer pairs that can be used to evaluate that agent. Either you can generate single-hop queries (queries that can be answered from a single instance of data) or multi-hop queries (can only be answered from multiple sources).
A sample output from the generator:
1 2 3 4 5 6 |
{ "pergunta": "How do the house rules for noise levels vary among the available rental options in Hell's Kitchen, Manhattan?", "resposta": [ "The house rules for noise levels among ......." ], } |
But this solution comes with its own disadvantages, the generated data is inditerministic and will always contain noise to some extent. Also, a really good LLM is required to generate the best samples, and these LLMs are mostly resource intensive and costly.
Run the agent on the ground truth
Once the ground truth is in place, the next step is to run the agent on this dataset. Run the agent on the question-answer pairs created by the generator, or on the reference ground truths annotated manually and retrieve the state output, where each agent state is in the following format (this state is logged by default if you are using popular frameworks like LangGraph):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{ "pergunta": "What is the price of copper?", "agent_responses": [ "The current price of copper is $0.0098 per gram." ], "agent_tool_calls": [ { "name" (nome): "get_price", "args": { "item": "copper" } } ], "agent_tool_outputs": [ "$0.0098" ], }, |
Pass this output from the agent to the data model (EvalDataset) to create a structured golden dataset (golden dataset here refers to the dataset containing all the data required for the evaluation, i.e., combination of agent output and the ground truth the agent was run on). The framework includes a LoadOperator class to wrap and persist this data. It ensures that synthetic datasets are:
-
- Validated against the schema
- Automatically versioned
- Stored in Couchbase with a meaningful dataset_description
A key component in retrieval and storage of data is the LoadOperator. It handles ingestion, storage, and retrieval of evaluation datasets, abstracting away the details of Couchbase storage and exposing a clean interface to the rest of the framework. You can load and retrieve the evaluation dataset to and from the Couchbase KV store using the dataset_id.
Create and perform an experiment
Once we have the responses from the agentic system (the golden dataset), we initiate an experiment (a single, managed instance of evaluation performed on the golden dataset) using the framework with the evaluation data and a set of experiment options consisting of the metrics (more on metrics in the next section) to use and other information regarding your agentic system.
Experiment management in our framework goes beyond just logging results, it provides a comprehensive, automated system for tracking every aspect of an evaluation run. Each experiment is attached to a uniquely identified dataset, loaded and versioned with descriptive metadata and timestamps. This ensures that every dataset, whether synthetic or real, can be traced and reused with full transparency. Configurable parameters (e.g., model checkpoints, prompt versions, tool chains) are also stored alongside results, creating a metadata trail for every run.
The experiment manager also provides you with the ability to initiate a chain of experiments, with each successive experiment tracking the changes in the agent from its parent experiment. The experiments are versioned using Git, commit your code and run the experiment to iteratively develop their agents and compare evaluations performed on the same agent across versions. The metadata of each experiment contains code difference logs, averaged metrics and configurations that can be used to analyze whether a change improved the agentic system or not. Additionally, all the datasets and experiments used in our framework are versioned, queryable, and scalable. We can store large evaluation sets, retrieve subsets for targeted analysis, and track metadata like timestamps and dataset descriptions, an enormous improvement over flat-file, spreadsheet or in-memory document based workflows.
Persist results in Couchbase
The output from an experiment consists of json and csv files with data instances and the corresponding scores. A sample result for a single agent conversation (individual data instance) is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
[ { "user_input": "What is the price of copper?", "retrieved_contexts": nulo, "response": nulo, "referência": nulo, "agent_responses": [ "", "The current price of copper is $0.0098 per gram." ], "agent_tool_calls": [ { "name" (nome): "get_metal_price", "args": { "metal_name": "copper" } } ], "agent_tool_outputs": [ "0.0098" ], "reference_tool_calls": [ { "name" (nome): "get_metal_price", "args": { "metal_name": "copper" } } ], "gt_answers": [ "", "The current price of copper is $0.0098 per gram." ], "gt_tool_outputs": [ "0.0098" ], "answer_faithfulness": 3, "logical_coherence": 1.0, "agent_response_correctness": 0.5 }, ] |
The results are stored in the Couchbase KV store with an experiment ID. Experiments can be retrieved using the LoadOperator, allowing you to refer and compare experiments performed during a different session.
In short, with the framework, running an experiment is not just a one-off script execution. It’s a managed process:
-
- You load or generate a ground truth dataset, which is versioned and described
- You configure your agent and evaluation parameters, all of which are logged
- You run the evaluation, and the framework automatically stores the results, along with all relevant metadata
- Later, you can retrieve any experiment, see exactly what were the changes made to the system, and compare it to other runs
- This level of experiment management is what turns evaluation from a “black box” into a transparent, repeatable, and collaborative process
Como isso funciona?

A mid-level architecture of the framework
At the core of the framework, there are four key components that work together to produce the final results.
Synthetic data generator
The data generator creates synthetic question-answer pairs from documents. These question-answer pairs can be used as the ground truth to evaluate your agentic system using our framework. The data generator takes in documents (CSV, JSON or plaintext) and leverages a few-shot prompt fine-tuned LLM to generate the pairs. The generation process is as follows:
-
- Documents ingested are cleaned and preprocessed
- A REBEL model is used to extract entity-relationships from each document. REBEL, a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types
- An entity-relation map is created for each document. Each of these entity-relation maps are embedded using the MiniLM-V2 embedding model (384 dim embeddings),
- The embeddings for each document is clustered using embedding clustering algorithms like HDBSCAN to get a ‘n’ clusters of semantically similar documents
- These document clusters are provided to the LLM to generate multi-hop query answer pairs

Architecture of the Synthetic data generator
Additional information and custom instructions can also be provided if the user wants to generate the query and answers in a specific format or style.
Evaluation dataset
Core data structure for managing ground truth data. The EvalDataset class takes in ground truth data and the agent outputs (tool calls, agent responses etc) and structures the raw data into an easy to process format for the validation process that follows. It transforms the golden dataset (dataset containing the ground truth and the agent responses for the same) into a list of attributes that are important for the evaluation and can be easily processed by the validation engine. The evaluation dataset created has a dataset ID which can be used to pull the dataset from the Couchbase key-value store, removing the need to store and manage the dataset locally.
Validation engine
Processes the evaluation dataset and performs the evaluation on the same. Connected with a metric catalog providing users with a whole set of metrics for evaluating each and every part of the agentic/RAG system. The metric catalog is also integrated with RAGAS, providing users the flexibility to use RAGAS metrics if they require. The validation engine calculates the metrics for the evaluation dataset and combines the results to form an interpretable dataframe along with an averaged index that provides users with an idea on how good the overall system is.
Experiment manager
Central module that connects all the other components together. Creates and manages evaluation experiments. An experiment is an individual instance of evaluation, consisting of a detailed output and metadata, along with code tracking capabilities to provide users with an insight on the changes they have made to their agentic system between two experiments.
The experiment manager takes in the evaluation data and a set of experiment options consisting of the metrics to use and other information regarding your agentic system and connects to the validation engine to evaluate the dataset and get the calculated scores. The scores are then processed and formatted to produce an evaluation report along with an experiment metadata which allows you to infer the evaluation and compare different experiments.
The experiment manager also provides users the ability to initiate a chain of experiments, with each successive experiment tracking the changes in the agent from its parent experiment. Allows users to iteratively develop their agents and compare evaluations performed on the same agent across versions. The metadata of each experiment contains code difference logs, averaged metrics and configurations that can be used to analyze whether a change improved the agentic system or not.
Picking the right metrics for your agentic system
When evaluating an agentic system, selecting the right metrics is critical to accurately assess its performance. The choice of metric directly influences how you interpret the outcome and make iterative improvements. For AI systems, metrics should be chosen based on the following considerations:
Sistema Tipo
-
- RAG systems: Focus on retrieval metrics (precision, recall) and generation metrics (faithfulness, answer correctness)
- Agentic systems: Prioritize tool call accuracy, logical coherence, and answer faithfulness
Uso Caso Requisitos
-
- Question answering: Emphasize answer correctness and relevancy
- Information retrieval: Focus on context precision and recall
- Reasoning tasks: Prioritize logical coherence and faithfulness
Technical Considerações
-
- Computation cost: Embedding-based metrics are heavier than token-based ones
- API dependencies: LLM-as-judge metrics require API access
- Processamento em lote: Some metrics support efficient batch evaluation
To make the above process easier for you, these are five metrics that provide the best overview of how good your agentic system performs:
-
- Ferramenta ctodos accuracy: Evaluates if the agent uses the right tools with the right parameters.
- Ferramenta accuracy: Compares tool outputs with ground truth tool outputs. Measures how accurate the tools are.
- Agente resposta correctness: Evaluates the correctness of agent responses compared to ground truth. Measures the quality of the overall agent response.
- Lógico coherence: Assesses the logical flow and reasoning in agent responses, helps to analyze the chain of command between agents in a system and how well each agent works with each other to answer the user query.
- Resposta faithfulness: Checks if the agent’s response is consistent with tool outputs fetched by the agent.
The final stage is analysis: aggregating results, computing metrics, and, crucially, understanding where and why the agent failed. This is where the lack of standardization and automation in the earlier steps comes back to haunt the developer. Debugging mismatches, tracing errors back to their source, and iterating on the agent or the data is slow and error-prone.
Digging deeper: interpreting results effectively
Now that you’ve selected the appropriate metrics, run the experiment, and obtained the results, what’s next? How do you make sense of the seemingly random numbers you’ve collected? Interpreting these results is a crucial step in understanding the behavior and performance of your system. There are several effective strategies to analyze the output, uncover insights, and assess the impact of the changes you’ve made. This step transforms raw metrics into actionable knowledge about your agent’s strengths, weaknesses, and areas for improvement.
Comparative analysis techniques
When comparing two different agent implementations or versions:
Side-by-side metric comparison
-
- Average each metric across all test cases for both agents
- Calculate the relative improvement between systems (e.g., “Agent B shows 12% improvement in tool call accuracy over Agent A”)
- Use radar charts to visualize the multi-dimensional performance landscape
- Identify complementary strengths (e.g., “Agent A excels at tool selection while Agent B produces more faithful responses”)
Paired analysis
-
- Compare performance on identical queries to identify systematic differences
- Calculate the percentage of queries where one agent outperforms the other
- Identify query types where performance differences are most pronounced
Example interpretation: “While Agent B has higher average tool call accuracy (0.87 vs 0.79), Agent A performs better on complex multi-step reasoning tasks, suggesting Agent B might use simpler but more reliable patterns.”
Distribution analysis approaches
Understanding the distribution of metric scores provides deeper insight than averages alone:
Histogram analysis
-
- Plot the distribution of scores for each metric
- Identify whether performance follows normal distribution or shows clustering/bimodality
- Compare the spread (variance) between different implementations
Quartile analysis
-
- Examine the 25th, 50th (median), and 75th percentiles
- A large gap between median and 75th percentile indicates inconsistent performance
- Focus improvement efforts on raising the bottom quartile
Example interpretation: “Agent A’s tool call accuracy has a bimodal distribution with peaks at 0.4 and 0.9, indicating it performs very well on some query types but struggles significantly on others. Agent B shows a narrower distribution centered at 0.75, indicating more consistent but less exceptional performance.”
Threshold-based analysis
Setting performance thresholds helps quantify success rates:
Success rate calculation
-
- Define acceptable thresholds for each metric (e.g., tool call accuracy > 0.85)
- Calculate the percentage of samples exceeding each threshold
- Identify which thresholds are most challenging to meet
Multi-criterion analysis
-
- Define success as meeting thresholds across multiple metrics simultaneously
- Calculate the percentage of samples meeting all criteria
- Identify the most common failing points
Example interpretation: “While 78% of Agent A’s responses meet our tool accuracy threshold of 0.9, only 62% simultaneously meet our answer faithfulness threshold of 0.85. This suggests the agent occasionally produces correct outputs through incorrect reasoning paths.”
By applying these analytical approaches to agentic metrics, you can develop a reasonable understanding of your system’s performance, make improvement decisions, and establish reliable quality standards for deployment. This systematic analysis helps move beyond simple metrics to truly understand the agent’s capabilities and limitations in different contexts.
Example: evaluating a data analysis agent
We tested this framework on a conversational agent designed to answer queries based on data stored in Couchbase. The agent takes in user questions, forwards the question to a NL2SQL++ tool that generates SQL++ queries to fetch the corresponding documents from the store, and generates a detailed answer and an analysis report for the user question using the retrieved documents.
For our evaluation, the agent was run on an AirBNB listings dataset which contains the details of AirBNB listings across the United States. Generated questions and reference answers on the data instances using the Synthetic Data Generator. Below given is a sample set of query answer pairs generated by the synthetic data generator:
1 2 3 4 5 6 7 8 9 10 |
[ { "pergunta": "What type of room is offered in the \"Clean and quiet apt home by the park\?" "resposta": "The room type offered in the \"Clean and quiet apt home by the park\" is a Private room. This conclusion is based on the data retrieved from the Airbnb listings, where the specific entry for this listing name was queried to determine the type of accommodation provided. The data clearly indicates that the listing is categorized under the \"Private room\" type, meaning guests will have a private space within a shared property." }, { "pergunta": "What is the cancellation policy for the \"Skylit Midtown Castle\"?" "resposta": "The cancellation policy for the \"Skylit Midtown Castle\" is moderate. This conclusion is drawn from the data retrieved from the Airbnb listings, where the specific entry for this listing name was queried to determine the cancellation terms. The data indicates that the listing follows a moderate cancellation policy, which typically allows for more flexibility compared to strict policies, offering guests the ability to cancel within a certain timeframe before the check-in date for a full refund." } ] |
A set of 40 such query document pairs were generated for this particular experiment. The agent was run on these generated queries and the output was logged to create the evaluation dataset.
The evaluation dataset (golden dataset) consists of:
-
- Questions: Questions generated on the dataset
- Ground truth answers: The reference (correct) answers for the generated questions
- Reference Context: The source of truth from which the queries were generated (Ground truth tool outputs)
- Retrieved Context: The documents retrieved using the NL2SQL++ tool run on the generated queries (tool outputs)
- Agent Responses: The agent’s responses given the query and the retrieved context
An experiment was created on this evaluation dataset using 3 metrics, semantic similarity, context precision and answer relevancy. Semantic similarity measures the embedding similarity between the retrieved and reference contexts. Context precision measures how precise the retrieved contexts are with respect to the query and the reference context. Answer relevancy measures how relevant the agent’s response is to the user query and the retrieved context.
The average metric scores along with the experiment metadata for this particular experiment is provided below. Here, “averaged” refers to the mean of each metric across the data points in the evaluation dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{ "experiment_id":"experiment5", "timestamp" (registro de data e hora):"2025-03-20T11:17:46.411457", "llm_model":"gpt-4o", "métricas":["semantic_similarity", "context_precision", "answer_relevancy"], "dataset_size":40, "dataset_id":"11b2d36a-4f00-40d2-bbe7-e614f4a77f1f", "avg_metrics":{ "semantic_similarity":0.85, "context_precision":0.99, "answer_relevancy":0.90, } } |
A metric quality analysis table can help us to analyze how the overall agent performed across the evaluation dataset, using a threshold for each metric and the number of data points that yielded a score above the given threshold.
S.No | Metric | Threshold | Samples Above Threshold | Total Number of Samples | Accuracy(%) |
1 | Semantic Similarity | 0.70 | 40 | 40 | 100.00 |
2 | Context Precision | 0.90 | 40 | 40 | 100.00 |
3 | Answer Relevancy | 0.70 | 37 | 40 | 92.50 |
In our evaluation, NL2SQL++ consistently demonstrated strong performance, with all 40 test samples achieving both semantic similarity and context precision scores above the predefined threshold. This indicates that the tool reliably captures user intent and accurately translates natural language queries into structured SQL.
The LLM responsible for generating final responses also performed exceptionally well. While 37 out of the 40 responses exceeded the metric threshold, the remaining 3 fell slightly below. This minor variance is expected, as LLMs inherently generate novel token sequences rather than replicating reference content line for line. Despite these deviations, the model maintained high answer accuracy across the board which if it hadn’t, we would have observed more significant metric drops.
Detailed experiment reports are available to inspect individual samples, including those that did not meet the threshold, providing insight into how much they deviated and the potential reasons why.
Conclusão
This framework has been built with the core requirement of providing users with a persistent method to evaluate AI systems across domains and use cases, making evaluation simpler by using a consistent format for data, automating data handling and creating example scenarios where real ones are hard to collect. It also tracks every step an agent takes, which helps when multiple agents work together. In the future, we expect these tools to keep improving by updating test cases as real-world needs change, generating easy-to-understand reports for everyone, and including checks to catch bias or unsafe behavior. By doing this, we can make sure that AI agents stay reliable, transparent and most importantly aligned with the developers’ needs.