Only this pageAll pages
Powered by GitBook
1 of 39

Root Signals Product Documentation

Loading...

QUICK START

Loading...

Loading...

OVERVIEW

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

USAGE

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

RESOURCES

Objectives

Objectives consist of a human-readable Intent and ground truth examples directly. An objective serves both the purposes of

  • Communication: Expressing the intended business purpose of the evaluator

  • Coordination: Serving as a battery of measures

Cookbook

Advanced use cases and common recipes

Intro

What is it?

Root Signals is a measurement, observability, and control platform for GenAI applications, automations, and agentic workflows powered by Large Language Models (LLMs). Such applications include chatbots, Retrieval Augmented Generation (RAG) systems, agents, data extractors, summarizers, translators, AI assistants, and various automations powered by LLMs.


Key Features

Any developer can use Root Signals to:

  • Add appropriate metrics such as Truthfulness, Answer Relevance, or Coherence of responses to any LLM pipeline and optimize their design choices (which LLM to use, prompt, RAG hyper-parameters, etc.) using these measurements:

    • Log, record, and compare their changes and the corresponding measurements belonging to those

    • Integrate metrics into CI/CD pipelines (e.g. GitHub actions) to prevent regressions

  • Turn those metrics into guardrails that prevent wrong, inappropriate, undesirable, or in general sub-optimal behaviour of their LLM apps simply by adding trigger thresholds. Monitor the performance in real-time, in production.

  • Create custom metrics for attributes ranging from 'mention of politics' to 'adherence to our communications policy document v3.1'.

Dashboard

The dashboard provides a comprehensive overview of the performance of your specific LLM applications:

Main dashboard

Ready-Made Evaluators

Root Signals provides 30+ built-in, ready-to-use evaluators called Root Evaluators.

Root evaluators

Custom Evaluators

Utilizing any LLM as a judge, you can create, benchmark, and tune Custom Evaluators.

Custom evaluator for policy adherence

Monitoring

We provide complete observability to your LLM applications through our Monitoring view.

Monitoring overview
Detailed metrics and trends for your specific applications

Using Root Signals

Root Signals is available via

  • 🖥️ Web UI

  • SDKs

    • 🐍 Python SDK and Root Proxy

      • GitHub repo

    • TypeScript SDK

  • 📑 REST API

  • 🔌 Model Context Protocol (MCP) Server (for Agents)

Root Signals can be used by individuals and organizations. Role-based Access Controls (RBAC), SLA, and security definitions are available for organizations. Enterprise customers also enjoy SSO signups via Okta and SAML.

Create a free Root Signals account and get started in 30 seconds.

Models

Models are the actual source of the intelligence. A model generally refers to both the type of model (such as GPT), the provider of the model (such as Azure), and the specific variant (such as GPT-4o). The models available on the Root Signals platform consist of:

  • Proprietary and hosted open-source models accessible via API. These models can be accessed via your API key or the Root Signals platform key (with corresponding billing responsibilities).

  • Open-source models provided by Root Signals.

  • Models added by your organization. See the for model details.

Control & Compliance

Some model providers are GDPR-compliant, ensuring data processing meets the General Data Protection Regulation requirements. However, please note that GDPR compliance by the provider does not necessarily mean that data is processed within the EU.

Organization admin can control the API keys and restrict access to a specific subset of models.

Cookbook page

Dataset permissions

Datasets in Root Signals contain static information that can be included as context for skill execution. They can contain information about your organization, products, customers, etc. Datasets are linked to skills using reference variables.

Access to datasets is controlled through permissions. By default, when a user uploads a new dataset, it is set to 'unlisted' status. Unlisted datasets are only visible to the user who created them and to administrators in the organization. This allows users to work on datasets privately until they are ready to be shared with others.

To make a dataset available to other users in the organization, the dataset owner or an administrator needs to change the status to 'listed'. Listed datasets are visible to all users in the organization and can be used in skills by anyone.

Dataset permissions do not control skill execution privilege

Note that dataset permissions control whether a dataset can be used in skill creation or skill editing as a reference variable or as a test data set. Unless more specific permissions information is made available via enterprise integrations, dataset permissions do not control who can use the data set in skill execution. I.e. once dataset in fixed to a skill as a reference variable, anyone who has privileges to execute the skill will also have implicit access to the data set through the skill execution.

It is important for dataset owners and administrators to carefully consider the sensitivity and relevance of datasets before making them widely available. Datasets may contain confidential or proprietary information that should only be accessible to authorized users.

Contact Root Signals for more fine-grained controls in enterprise, regulated or governmental contexts.

In summary

The dataset permission system in Root Signals allows for granular control over who can access and use specific datasets. The unlisted/listed status toggle and the special privileges granted to administrators provide flexibility in managing data assets across the organization. Proper management of dataset permissions is crucial for ensuring data security and relevance in skill development and execution.

Execution, Auditability and Versioning

The requests to any models wrapped within skill objects, and their responses, are traceable within the log objects or Root Signals platform.

The retention of logs is determined by your platform license. You may export logs at any point for your local storage. Access to execution logs is restricted based on your user role and skill-specific access permissions.

Objectives, evaluators, skills and test datasets are strictly versioned. The version history allows keeping track of all local changes that could affect the execution.

To understand reproducibility of pipelines of generative models, these general principles hold:

  • For any models, we can control for the exact inputs to the model, record the responses received, and the evaluator results of each run.

  • For open source models, we can pinpoint the exact version of the model (weights) being used, if this is guaranteed by the model provider, or if the provider is Root Signals.

  • For proprietary models whose weights are not available, we can pinpoint the version based on the version information given by the providers (such as gpt-4-turbo-2024-04-09) but we cannot guarantee those models are, in reality, fully immutable

  • Any LLM request with 'temperature' parameter above 0 is guaranteed not to be deterministic. Temperature = 0 and/or a fixed value of a 'seed' parameter usually mean the result is deterministic, but your mileage may vary.

Access Controls & Roles

Access to all Root Signals entities is controlled via user roles as well as entity-specific settings by the entity creator. The fundamental roles are the User and the Administrator.

Administrator privileges

Administrators have additional entity management privileges when it comes to management of datasets, objectives, skills and evaluators:

  1. Administrators can see all the entities in the organization, including unlisted ones. This allows them to have an overview of all the data and functional assets in the organization.

  2. Administrators can change the status of any entity such as a dataset from unlisted to listed and vice versa. This enables them to control which entities are shared with the wider organization.

  3. Administrators can delete any entity, regardless of who created it. This is useful for managing obsolete or irrelevant entities.

Administrator also controls the accessibility of models across the organization, users, and billing.

Use evaluators and RAG

Root Signals provides evaluators for RAG use cases, where you can give the context as part of the evaluated content.

One such evaluator is the Truthfulness evaluator, which measures the factual consistency of the generated answer against the given context and general knowledge.

Here is an example of running the Truthfulness evaluator using the Python SDK. Pass the context used to get the LLM response in the contexts parameter.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

result = client.evaluators.Truthfulness(
    request="What was the revenue in Q1/2023",
    response="The revenue in the last quarter was 5.2 M USD",
    contexts=[
        "Financial statement of 2023"
        "2023 revenue and expenses...",
    ],
)
print(result.score)
# 0.5

Integrations

Both our Skills and Evaluators may be used as custom-generator LLMs in 3rd party frameworks and we are committed to support OpenAI ChatResponse compatible API.

Note, however, that additional functionality, such as validation results, calibration etc., are not available as part of OpenAI responses and require the user to implement additional code if anything besides failing on unsuccessful validation is required.

Advanced use-cases can rely on referencing the completion.id returned by our API as unique identifier for downstream tasks. Please refer to the Cookbook section for details.

Breaking Change Policy

We adhere to Semantic Versioning (SemVer) principles to manage the versions of our software products effectively. This ensures clarity and predictability in how updates and changes are handled.

Communication of Breaking Changes

  1. Notification: All breaking changes are communicated to stakeholders via email. These notifications provide details about the nature of the change, the reasons behind it, and guidance on how to adapt to these changes.

  2. Versioning: When a breaking change is introduced, the major version number of the software is incremented. For example, an upgrade from version 1.4.5 to 2.0.0 indicates the introduction of changes that may disrupt existing workflows or dependencies.

  3. Documentation: Each major release accompanied by breaking changes includes updated documentation that highlights these changes and provides comprehensive migration instructions to assist in transitioning smoothly

Lifecycle Management

In Root Signals, evaluation is treated as a procedure to compute a metric grounded on a human-defined criteria, emphasizing the separation of utility grounding (Objective) and implementation (Evaluator function).

This lets the criteria and implementations for the evaluations evolve in two separate controlled and trackable tracks, each with different version control logic.

Metric evaluators are different from other entities in the world, and simply treating them as "grounded in data", on one hand, or as "tests", on the other, misses some of their core properties.

In Root Signals, an Objective consist of

  • Intent that is human-defined and human-understanable, corresponding to the precise attribute being meausred.

  • Calibration data set that defines, via examples, the structure and scale of those criteria.

An Evaluator function consists of:

  • Predicate that uniquely specifies the task to the LLMs that power the evaluator

  • LLM

  • In-context examples (demonstrations)

  • [Optionally] Associated data files

An Evaluator function is typically associated with an Objective that connects it to business / contextual value, but the two have no causal connection.

Root Signals platform itself handles:

  • Semantic quantization: Guaranteeing the predicates are consistently mapped to metrics (for supported LLMs). This lets us abstract the predicates out of the boilerplate prompts needed to yield robust metrics

  • Version control of evaluator implementations

  • Maintenance of relationships* between Objectives and Evaluators

  • Monitoring

E.g. If an Objective is changed (e.g. it's calibration dataset is altered), it is not a priori clear if the related criteria, which then affect all evaluator variants using the Objective, rendering measurements backwards-incompatible. Hence, the best-practise enforced by Root Signals platform is to create an entirely new Objective, so that it is clear the criteria have changed. This can be bypassed, however, when the Objective is still in formation stage and/or you accept that the criteria will change over time.

Over time, improved evaluator functions will be created (including but not limited to model updates) to improve upon the Objective targets. On the other hand, Objectives tend to branch and become more precise over time, passing the burden of resolving the question of "is this still the same Objective" to the users, while providing the software support to make those calls either way in an auditable and controllable manner.

Getting started in 30 seconds

Scorable

Scorable is the automated LLM Evaluation Engineer agent for co-managing Root Signals platform with you.

From the App

To get started, sign up and login to Root Signals app. Select an evaluator under Evaluators tab and Execute. You will get a score between 0 and 1 and the justification for the score.

Programmatically

Create your Root Signals API key under Settings > Developer.

In Python

pip install root-signals

Quick test

For the best experience, we encourage you to create an account. However, if you prefer to run quick tests at this point, please create a temporary API key here.

Root Signals provides over 30 evaluators or judges, which you can use to score any text based on a wealth of metrics. You can attach evaluators to an existing application with just a few lines of code.

from root import RootSignals

# Just a quick test?
# You can get a temporary API key from https://app.rootsignals.ai/demo-user 
client = RootSignals(api_key="my-developer-key")
client.evaluators.Politeness(
    response="You can find the instructions from our Careers page."
)
# {score=0.7, justification='The response is st...', execution_log_id=...}
  • Python SDK Docs

  • Python SDK GitHub Repo

In Typescript

npm install @root-signals/typescript-sdk
# or
yarn add @root-signals/typescript-sdk
# or  
pnpm add @root-signals/typescript-sdk

and execute:

import { RootSignals } from '@root-signals/typescript-sdk';

// Connect to Root Signals API
const client = new RootSignals({
  apiKey: process.env.ROOTSIGNALS_API_KEY!
});

// Run any of our ready-made evaluators
const result = await client.evaluators.executeByName('Helpfulness', {
  response: "You can find the instructions from our Careers page."
});
  • Typescript SDK GitHub Repo

Via REST API

You can execute evaluators in your favourite framework and tech stack via our REST API:

Root Signals MCP Server (for Agents)

Our Model Context Protocol (MCP) equips your AI agents with evaluation capabilities.

  • Root Signals MCP Repo

Concepts

Root Signals design philosophy starts from the principle of extreme semantic rigor. Briefly, this means making sure that, for example

  • The definitions and references of entities are tracked with maximal (and increasing) precision

  • Entities are assumed long-term and upgradeable

  • Entities are built for re-use

  • Changes will be auditable

Objective defines what you intend to achieve. It grounds an AI automation to the business target, such as providing a feature ('transform data source X into usable format Y') or a value ('suitability for use in Z').

Evaluator is a function that assigns a numeric value to a piece of content such as text, along a semantically defined dimension (truthfulness, relevance of an answer, coherence, etc.).

An example ready-made evaluator

Judge is a stack of evaluators with a high-level Intent.

An example Judge that contains several evaluators

Model is the AI model such as an LLM that provides the semantic processing of the inputs. Notably, the list contains both API-based models such as OpenAI and Anthropic models, and open source models such as Llama and Mistral models. Finally, you can add your own locally running models to the list with ease. The organization Admin controls the availability of models enabled in your organization.

Add a calibration test set

To ensure the reliability of the Direct Language evaluator, you can create and use test data, referred to as a calibration dataset. A calibration set is a collection of LLM outputs, prompts, and expected scores that serve as benchmarks for evaluator performance.


1. Attaching a Calibration Set

Start by attaching an empty calibration set to the evaluator:

  1. Navigate to the Direct Language evaluator page and click Edit.

  2. Select the Calibration section and click Add Dataset.

  3. Name the dataset (e.g., “Direct Language Calibration Set”).

  4. Optionally, add sample rows, such as:

    "0,2","I am pretty sure that is what we need to do"
  5. Click Save and close the dataset editor.

  6. Optionally, click the Calibrate button to run the calibration set.

  7. Save the evaluator


2. Adding Production Samples to the Calibration Set

You can enhance your calibration set using real-world data from evaluator runs stored in the execution log.

  1. Go to the Execution Logs page.

  2. Locate a relevant evaluator run and click on it.

  3. Click Add to Calibration Dataset to include its output and score in the calibration set.

By regularly updating and running the calibration set, you safeguard the evaluator against unexpected behavior, ensuring its continued accuracy and reliability.

Evaluator Portfolio

List of well-calibrated Root evaluators include:

  • Relevance

  • Safety for Children

  • Sentiment Recognition

  • Coherence

  • Conciseness

  • Engagingness

  • Originality

  • Clarity

  • Precision

  • Persuasiveness

  • Confidentiality

  • Harmlessness

  • Formality

  • Politeness

  • Helpfulness

  • Non-toxicity

  • Faithfulness RAG Evaluator

  • Faithfulness-swift RAG Evaluator

  • Answer Relevance

  • Truthfulness RAG Evaluator

  • Truthfulness-swift RAG Evaluator

  • Quality of Writing - Professional

  • Quality of Writing - Creative

  • JSON Content Accuracy RAG Evaluator | Function Call Evaluator

  • JSON Property Completeness Function Call Evaluator

  • JSON Property Type Accuracy Function Call Evaluator

  • JSON Property Name Accuracy Function Call Evaluator

  • JSON Empty Values Ratio Function Call Evaluator

  • Answer Semantic Similarity Ground Truth Evaluator

  • Answer Correctness Ground Truth Evaluator

  • Context Recall RAG Evaluator | Ground Truth Evaluator

  • Context Precision RAG Evaluator | Ground Truth Evaluator

  • Summarization Quality

  • Translation Quality

  • Information Density

  • Reading Ease

  • Planning Efficiency

  • Answer Willingness

Details of each evaluator can be found .

Judges

Judges are stacks of with their own high-level intent.

You can see the overview of your Judges in the app:

You can inspect a Judge in detail as well:

Via OpenAI-compatible Endpoint

cURL

Python

Add a custom evaluator

Root Signals provides evaluators that fit most needs, but you can add custom evaluators for specific needs. In this guide, we will add a custom evaluator and tune its performance using demonstrations.

Example: Weasel words

Consider a use case where you need to evaluate a text based on its number of weasel words or ambiguous phrases. Root Signals provides the optimized Precision evaluator for this, but let's build something similar to go through the evaluator-building process.

  1. Navigate to the Evaluator Page:

    • Go to the evaluator page and click on "New Evaluator."

  2. Name Your Evaluator:

    • Type the name for the evaluator, for example, "Direct language."

  3. Define the Intent:

    • Give the evaluator an intent, such as "Ensures the text does not contain weasel words."

  4. Create the Prompt:

    • "Is the following text clear and has no weasel words"

  5. Add a placeholder (variable) for the text to evaluate:

    • Click on the "Add Variable" button to add a placeholder for the text to evaluate.

      • E.g., "Is the following text clear and has no weasel words: {{response}}"

  6. Select the Model:

    • Choose the model, such as gpt-4-turbo, for this evaluation.

  7. Save and Test the Evaluator:

    • Click Create evaluator and .

Improve the custom evaluator performance

You can add demonstrations to the evaluator to tune its scores to match more closely to the desired behavior.

Example: Improve the Weasel words evaluator

Let's penalize using the word "probably"

  1. Go to the Weasel words evaluator and click Edit

  2. Click Add under Demonstrations section

  3. Add a demonstration

    • Type to the Response field: "This solution will probably work for most users."

    • Score: 0,1

  4. Save the evaluator and try it out

Note that adding more demonstrations, such as

  • "The project will probably be completed on time."

  • "We probably won't need to make any major changes."

  • "He probably knows the answer to your question."

  • "There will probably be a meeting tomorrow."

  • "It will probably rain later today."

will further adjust the evaluator's behavior. Refer to the full evaluator for more information.

Evaluate an LLM response

Building production-ready and reliable AI applications requires safeguards provided by an evaluation layer. LLM responses can vary drastically based on even the slightest input changes.

Root Signals provides a robust set of fundamental evaluators suitable for any LLM-based application.

Setup

You need a few examples of LLM outputs (text). Those can be from any source, such as a summarization output on a given topic.

Running an evaluator through the UI

shows all evaluators at your disposal. Root Signals provides the base evaluators, but you can also build custom evaluators for specific needs.

Let's start with the Precision evaluator. Based on the text you want to evaluate, feel free to try other evaluators as well.

  1. Click on the Precision evaluator and then click on the Execute skill button.

  2. Paste the text you want to evaluate into the output field and click Execute. You will get a numeric score based on the metric the evaluator is evaluating and the text to evaluate.

An individual score is not very interesting. The power of evaluation lies in integrating evaluators into an LLM application.

Integrating evaluators as part of existing AI automation

Integrating the evaluators as part of your LLM application is a more systematic approach to evaluating LLM outputs. That way, you can compare the scores over time and take action based on the evaluation results.

The Precision evaluator details page contains information on how to add it to your application. First, you must fetch a Root Signals API key and then execute the example cURL command.

  1. Go to the Precision evaluator details page

  2. Click on the Add to your application link

  3. Copy the cURL command

You can omit the request field from the data payload and add the text to evaluate in the response field. Example (cURL)

Example (Python SDK)

Connect a model

Root Signals subscription provides you can use in any of your skills. You are not limited by that selection, though. Integrating with cloud providers' models or connecting to locally hosted models is possible or the REST API.

For a complete list of available models, please get in touch with Root Signals support.

Huggingface example

To use an , add the model endpoint via the SDK or through the REST API

After adding the model, you can use it like any other model in your skills and evaluators.

here
curl --request POST \
     --url https://api.app.rootsignals.ai/v1/models/ \
     --header 'Authorization: Api-Key $ROOT_SIGNALS_API_KEY' \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
               {
                 "name": "huggingface/meta-llama/Meta-Llama-3-8B",
                 "url": "https://my-endpoint.huggingface.cloud",
                 "default_key": "$HF_KEY"
               }
            '
skill = client.skills.create(
    name="My model test", prompt="Hello, my model!", model="huggingface/meta-llama/Meta-Llama-3-8B"
)
a set of models
through the SDK
HF inference endpoint
# pip install openai
from openai import OpenAI


client = OpenAI(
    api_key="$MY_API_KEY",
    base_url="https://api.app.rootsignals.ai/v1/judges/$MY_JUDGE_ID/openai/"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "I want to return my product"}
    ]
)

print(f"Assistant's response: {response.choices[0].message.content}")
print(f"Judge evaluation results: {response.model_extra.get('evaluator_results')}")
curl 'https://api.app.rootsignals.ai/v1/judges/$MY_JUDGE_ID/execute/' \
-H 'authorization: Api-Key $MY_API_KEY' \
-H 'content-type: application/json' \
--data-raw '{"response":"LLM said: You can return the item within 30 days of purchase, and we will refund the full amount...","request":"I want to return my product"}'
# pip install root-signals
from root import RootSignals

client = RootSignals(api_key="$MY_API_KEY")
result = client.judges.run(
    judge_id="$MY_JUDGE_ID",
    response="LLM said: You can return the item within 30 days of purchase, and we will refund the full amount...",
    request="I want to return my product"
)
print(f"Run results: {result.evaluator_results}")
# Score (a float between 0 and 1): {result.evaluator_results[0].score}
# Justification for the score: {result.evaluator_results[0].justification}
Evaluators
begin experimenting with it
documentation
curl 'https://api.app.rootsignals.ai/v1/skills/evaluator/execute/767bdd49-5f8c-48ca-8324-dfd6be7f8a79/' \
                                                   -H 'authorization: Api-Key <YOUR API KEY>' \
                                                   -H 'content-type: application/json' \
                                                   --data-raw '{"response":"While large language models (LLMs) have many powerful applications, there are scenarios where they are not as effective or suitable. Here are some use cases where LLMs may not be useful:\n\nReal-Time Critical Systems:\nLLMs are not ideal for applications requiring real-time, critical decision-making, such as air traffic control, medical emergency systems, or autonomous vehicle navigation, where delays or errors can have severe consequences.\n\nHighly Specialized Expert Tasks:\nTasks that require deep domain-specific expertise, such as advanced scientific research, complex legal analysis, or detailed medical diagnosis, may be beyond the capabilities of LLMs due to the need for precise, highly specialized knowledge and judgment."}'
# pip install root-signals
from root import RootSignals

client = RootSignals(api_key="<YOUR API KEY>")
client.evaluators.Precision(
    response="While large language models (LLMs) have many powerful applications, there are scenarios where they are not as effective or suitable. Here are some use cases where LLMs may not be useful:\n\nReal-Time Critical Systems:\nLLMs are not ideal for applications requiring real-time, critical decision-making, such as air traffic control, medical emergency systems, or autonomous vehicle navigation, where delays or errors can have severe consequences.\n\nHighly Specialized Expert Tasks:\nTasks that require deep domain-specific expertise, such as advanced scientific research, complex legal analysis, or detailed medical diagnosis, may be beyond the capabilities of LLMs due to the need for precise, highly specialized knowledge and judgment."
)
The evaluators listing page
Evaluator result 0.85 for the text summarizing LLM use-cases.

Langfuse

Example requires langfuse v3.0.0

import os
from langfuse import observe, get_client
from root import RootSignals 

# Initialize Langfuse client using environment variables
# LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_HOST
langfuse = get_client()

# Initialize RootSignals client
rs = RootSignals()

Integration

observe(name="explain_concept_generation") # Name for traces in Langfuse UI
def explain_concept(topic: str) -> tuple[str | None, str | None]: # Returns content and trace_id
    # Get the trace_id for the current operation, created by @observe
    current_trace_id = langfuse.get_current_trace_id()
   
    prompt = prompt_template.format(question=topic)
    response_obj = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="gpt-4",
    )
    content = response_obj.choices.message.content    
    return content, current_trace_id
def evaluate_concept(request: str, response: str, trace_id: str) -> None:

    # Invoke a specific Root Signals judge
    result = rs.judges.run(
        judge_id="4d369224-dcfa-45e9-939d-075fa1dad99e", 
        request=request,  # The input/prompt provided to the LLM
        response=response, # The LLM's output to be evaluated
    )

    # Iterate through evaluation results and log them as Langfuse scores
    for eval_result in result.evaluator_results:
        langfuse.create_score(
            trace_id=trace_id,                   # Links score to the specific Langfuse trace
            name=eval_result.evaluator_name,     # Name of the Root Signals evaluator (e.g., "Truthfulness")
            value=eval_result.score,             # Numerical score from the evaluator
            comment=eval_result.justification,   # Explanation for the score
        )
       
         
    

Table: Mapping Root Signals Output to Langfuse Score Parameters

Root Signals

Langfuse

Description in Langfuse Context

evaluator_name

name

The name of the evaluation criterion (e.g., "Hallucination," "Conciseness"). Used for identifying and filtering scores.

score

value

The numerical score assigned by the Root Signals evaluator.

justification

comment

The textual explanation from Root Signals for the score, providing qualitative insight into the evaluation

Done. Now you can explore detailed traces and metrics in the Langfuse dashboard.

Principles

The Root Signals platform is built upon foundational principles that ensure semantic rigor, measurement accuracy, and operational flexibility. These principles guide the design and implementation of all platform features, from evaluator creation to production deployment.

1. Separation of Concerns: Objectives and Implementations

At the core of Root Signals lies a fundamental distinction between what should be measured and how it is measured:

  • An Objective defines the precise semantic criteria and measurement scale for evaluation.

  • An Evaluator represents an implementation that can meet these criteria.

This separation enables:

  • Multiple evaluator implementations for the same objective

  • Evolution of measurement techniques without changing business requirements

  • Clear communication between stakeholders about evaluation goals

  • Standardized benchmarking across different implementations

In practice, an objective consists of an Intent (describing the purpose and goal) and a Calibrator (the score-annotated dataset providing ground truth examples). The evaluator's function—comprised of prompt, demonstrations, and model—represents just one possible implementation of that objective.

2. Calibration and Measurement Accuracy

Every measurement instrument requires calibration against known standards. In Root Signals, evaluators undergo rigorous calibration to ensure their scores align with human judgment baselines. This process involves:

  • Calibration datasets: Ground truth examples with expected scores, including optional justifications that illustrate the rationale for specific scores

  • Deviation analysis: Quantitative assessment using Root Mean Square to calculate total deviance between predicted and actual values

  • Continuous refinement: Iterative improvement based on calibration results, focusing on samples with highest deviation

  • Version control: Tracking evaluator performance across iterations

  • Production feedback loops: Adding real execution samples to calibration sets for ongoing improvement

The calibration principle acknowledges that LLM-based evaluators are probabilistic instruments requiring empirical validation. Calibration samples must be strictly separated from demonstration samples to ensure unbiased measurement.

3. Metric-First Architecture

All evaluations in Root Signals are fundamentally metric evaluations, producing normalized scores between 0 and 1. This universal approach provides:

  • Generalizability: Any evaluation concept can be expressed as a continuous metric

  • Optimization capability: Numeric scores enable gradient-based optimization

  • Fuzzy semantics handling: Real-world concepts exist on spectrums rather than binary states

  • Composability: Metrics can be combined, weighted, and aggregated

This principle recognizes that language and meaning are inherently fuzzy, requiring nuanced measurement approaches. Every evaluator maps text to a numeric value, enabling consistent measurement across diverse dimensions like coherence (logical consistency), conciseness (brevity without information loss), or harmlessness (absence of harmful content).

4. Model Agnosticism and EvalOps

The platform maintains strict independence from specific model implementations, both for operational models (those being evaluated) and judge models (those performing evaluation). This enables:

  • Model comparison: Evaluate multiple models using identical criteria

  • Performance optimization: Select models based on accuracy, cost, and latency trade-offs

  • Future-proofing: Integrate new models as they become available

  • Vendor independence: Avoid lock-in to specific model providers

Changes in either operational or judge models can be measured precisely, enabling data-driven model selection. The platform supports API-based models (OpenAI, Anthropic), open-source models (Llama, Mistral), and custom locally-running models. Organization administrators control model availability, ensuring governance while maintaining flexibility.

5. Interoperability and Portability

Evaluation definitions must transcend platform boundaries through standardized, interchangeable formats. This principle ensures:

  • Clear entity references: Distinguish between evaluator references and definitions

  • Objective portability: Move evaluation criteria between systems

  • Implementation flexibility: Express objectives independent of specific implementations

  • Semantic preservation: Maintain meaning across different contexts

The distinction between referencing an entity and describing it enables robust system integration.

6. Dimensional Decomposition

Complex evaluation predicates can be expressed either as a single (inherently composite) evaluator or decomposed into a vector of multiple independent evaluators that effectively indicate a dimension of measurement. This principle provides:

  • Granular calibration: Each dimension can be independently calibrated

  • Modular development: Evaluators can be developed and tested separately

  • Precise diagnostics: Identify which specific dimensions need improvement

  • Flexible composition: Combine dimensions based on use case requirements

For example, "helpfulness" might decompose into truthfulness, relevance, completeness, and clarity—each with its own evaluator and calibration set. This decomposition extends to specialized domains: RAG evaluators (faithfulness, context recall), structured output evaluators (JSON accuracy, property completeness), and task-specific evaluators (summarization quality, translation accuracy), etc. Judges represent practical implementations of this principle, stacking multiple evaluators to achieve comprehensive assessment.

7. Operational Objectives

Similar to evaluation objectives, an operational task should have an objective that defines its success criteria independent of implementation. An operational objective consists of:

  • Intent: The business purpose of the operation.

  • Success criteria: The set of evaluators that together define acceptable outcomes and what good looks like.

This set of evaluators can be captured in a judge, while the intent is capture in the Judge intent description.

  • Implementation independence: Multiple ways to achieve the objective

This principle extends the objective/implementation separation to operational workflows, enabling outcome-based task definition rather than prescriptive implementation.

8. Orthogonality of the Root Evaluator Stack

The Root Evaluators are designed as a set of primitive, orthogonal measurement dimensions that minimize overlap while maximizing coverage. This principle ensures:

  • Minimal redundancy: Each evaluator measures a distinct semantic dimension

  • Maximal composability: Evaluators combine cleanly without interference

  • Complete coverage: The primitive set spans the space of common evaluation needs

  • Predictable composition: Combining evaluators yields intuitive results

This orthogonality enables judges to be constructed as precise combinations of primitive evaluators. For instance, "professional communication quality" might combine:

  • Clarity (information structure)

  • Formality (tone appropriateness)

  • Precision (technical accuracy)

  • Grammar correctness (linguistic quality)

Each dimension contributes independently, allowing fine-grained control over the composite evaluation. The orthogonal design prevents double-counting of features and ensures that improving one dimension doesn't inadvertently degrade another. In cases where one can arguably interpret the evaluator in several different ways, we split these into separate objectives and corresponding Root Evaluators, such as in the case of relevance which may or may not be interpreted to include truthfulness (for instance, in a factual context, an untrue statement is arguably irrelevant, whereas in a story or hypothetical context, this may not be the case).

Practical Implications

These principles manifest throughout the Root Signals platform:

  • Evaluator creation starts with objective definition before implementation

  • Calibration workflows ensure measurement reliability

  • Judge composition allows stacking evaluators for complex assessments

  • Version control tracks both objectives and implementations

  • API design separates concerns between what and how

By adhering to these principles, Root Signals provides a semantically rigorous foundation for AI evaluation that scales from simple metrics to complex operational workflows.

Vertex AI Agent Builder

Integrate Root Signals evaluations with Google Cloud's Vertex AI Agent Builder to monitor and improve your conversational AI agents in real-time.

Architecture Overview

[Vertex AI Agent Builder]
     |
     |—→ [Webhook call (to Cloud Function / Cloud Run)]
                  |
                  |—→ [Root Signals API]
                  |
                  |—→ [Evaluate response]
                  |
           [Log result / augment reply]
                  |
     ←——————— Reply to Agent Builder user

🔧 Step-by-Step Integration

1. Set up a webhook in Vertex AI Agent Builder

  • Go to "Manage Fulfillment" in the Agent Builder UI.

  • Create a webhook (can be a Cloud Function, Cloud Run, or any HTTP endpoint).

  • This webhook will receive request and response pairs from user interactions.


2. Create a middleware endpoint (Cloud Function or Cloud Run)

This endpoint will:

  • Receive user input and the LLM response.

  • Construct an evaluator call to Root Signals API.

  • Send the result back as part of the webhook response (optional).

Option 1: Using Built-in Evaluators

app.post('/evaluate', async (req, res) => {
  const userInput = req.body.sessionInfo.parameters.input;
  const modelResponse = req.body.fulfillmentResponse.messages[0].text.text[0];

  // Use a built-in evaluator (e.g., Relevance)
  const evaluatorPayload = {
    request: userInput,
    response: modelResponse,
  };

  const evaluatorResult = await fetch('https://api.app.rootsignals.ai/v1/skills/evaluator/execute/YOUR_EVALUATOR_ID/', {
    method: 'POST',
    headers: {
      'Authorization': 'Api-Key YOUR_API_KEY',
      'Content-Type': 'application/json',
    },
    body: JSON.stringify(evaluatorPayload),
  });

  const result = await evaluatorResult.json();
  console.log('Evaluator Score:', result.score);

  // Return modified response (if needed)
  res.json({
    fulfillment_response: {
      messages: [
        {
          text: {
            text: [
              `${modelResponse} (Quality score: ${result.score.toFixed(2)})`
            ]
          }
        }
      ]
    }
  });
});

Option 2: Using Custom Judges

app.post('/evaluate', async (req, res) => {
  const userInput = req.body.sessionInfo.parameters.input;
  const modelResponse = req.body.fulfillmentResponse.messages[0].text.text[0];

  // Use a custom judge
  const judgePayload = {
    request: userInput,
    response: modelResponse,
  };

  const judgeResult = await fetch('https://api.app.rootsignals.ai/v1/judges/YOUR_JUDGE_ID/execute/', {
    method: 'POST',
    headers: {
      'Authorization': 'Api-Key YOUR_API_KEY',
      'Content-Type': 'application/json',
    },
    body: JSON.stringify(judgePayload),
  });

  const result = await judgeResult.json();
  console.log('Judge Score:', result.evaluator_results);

  // Return modified response (if needed)
  res.json({
    fulfillment_response: {
      messages: [
        {
          text: {
            text: [
              `${modelResponse} (Judge results: ${JSON.stringify(result.evaluator_results)})`
            ]
          }
        }
      ]
    }
  });
});

3. Configure evaluators and judges

Built-in Evaluators:

  • Use evaluators like Relevance, Precision, Completeness, Clarity, etc.

  • Get available evaluators by logging in to https://app.rootsignals.ai/

  • Examples: Relevance, Truthfulness, Safety, Professional Writing

Custom Judges:

  • Create custom judges that combine multiple evaluators - use https://scorable.rootsignals.ai/ to generate a judge.

  • Judges provide aggregated scoring across multiple criteria

LangChain

Coming Soon!

Frequently Asked Questions

Terminology

What is Intent for?

Intent is the high-level, human-understandable description of the attribute an Evaluator measures. For example: “To measure how clearly the returns handler explains the 20% discount offer on the next purchase”.

What are Datasets?

Datasets allow you to bring test data for benchmarking (Root & Custom) and optimizing (Custom) evaluators.

Behaviour

Does Intent change the behaviour of the evaluator?

No. Evaluator Intent does not alter the evaluator behaviour.

Does Calibration change the behaviour of the evaluator?

No. Calibration is for benchmarking (testing) the evaluators to understand whether they are "calibrated" to your expected/desired behaviour or not. Calibration samples do not alter the behaviour of the evaluators.

How do Demonstrations work?

Demonstrations are used as in-context few-shot samples combined with our well-tuned meta-prompt. They are not utilized for supervised fine-tuning (SFT).

Usage

Our stack is not in Python, can we still use Root Signals?

Absolutely. We have a REST API that you can run from your favourite tech stack.

Do I need to have Calibrations for all Custom Evaluators?

You do not have to bring Calibration samples but we strongly recommend at least a handful of them in order to understand the behaviour of the evaluators.

Can I change the behaviour of the evaluator by bringing labeled data?

You can change the behaviour of your Custom Evaluators by bringing annotated samples as Demonstrations. Behaviour of Root Evaluators can not be altered.

Can I run a previous version of a Custom Evaluator?

Yes.

If we already have a ground truth expected output, can we use your evaluators?

Yes. Various evaluators from us support reference-based evaluations where you can bring your ground truth expected responses. See our evaluator catalogue here.

How can I differentiate evaluations and related statistics for different applications (or versions) of mine?

You can use arbitrary tags for evaluation executions. See the example here.

Can I integrate Root Signals evaluators to experiment tracking tools such as MLflow etc.?

Yes. Our evaluators return a structured response (e.g. a dictionary) with scores, justifications, tags etc. These results can be logged to any experiment tracking system or database similar to any other metric, metadata, or attribute.

Models

What is the LLM that powers ready-made Root evaluators? Can I change it?

Root Evaluators are powered by various LLMs under the hood. This can not be changed except for on-premise deployments.

Can I see which models are GDPR compliant?

Yes, you can see model metadata under Settings > LLM Accounts. More info can be found under Control & Compliance section of our docs.

Are Evaluators/Judges deterministic?

No. We have tight confidence intervals (for the same input) but small fluctuations are to be expected. Expected standard deviations can be found in our docs.

Usage Flows

Root Signals enables several key workflows that transform how organizations measure, optimize, and control their AI applications. These flows represent common patterns for leveraging the platform's capabilities to achieve concrete outcomes.

Flow 1: Explicit Decomposition Structure as the First Class Citizen

In this flow, we transform a description of the workflow or measurement problem into a judge, consisting of a concrete set of evaluators that precisely measure success. The process involves:

  1. Success Criteria Definition: Start with your business problem or use case description, and/or what dimensions of success matter for your specific context

  2. Evaluator Selection: Map success criteria to specific evaluators from the Root Signals portfolio or create custom ones

  3. Evaluator Construction: Create custom evaluators for key measurement targets

  4. Judge Assembly: Combine selected evaluators into a coherent measurement strategy

Example: For a customer service chatbot, the problem "a chatbot for which we must ensure helpful and accurate responses" might decompose into:

  • Relevance evaluator (responses address the customer's question)

  • Completeness evaluator (all aspects of queries are addressed)

  • Politeness evaluator (maintaining professional tone)

  • Policy adherence evaluator (following company guidelines)

Flow 2: Optimization Flow

Evaluator-Driven Improvement of Prompts and Models for Operational Prompts

Given a set of evaluators, this flow systematically improves your AI application's performance:

  1. Baseline Measurement: Evaluate current prompts and models against the evaluators

  2. Variation Testing: Test different prompts, models, and configurations

  3. Optimal Performance Selection: Choose the configuration that maximizes evaluator scores against costs, and latencies

Key considerations:

  • Balance accuracy improvements against cost increases

  • Consider latency requirements for real-time applications

Calibration Data-Driven Improvement of Predicates and Models for Evaluators

Given a calibration dataset, this flow systematically improves the performance of individual evaluators:

  1. Baseline Measurement: Evaluate the current predicate and model against the calibration dataset

  2. Variation Testing: Test different predicates, models, and configurations

  3. Optimal Performance Selection: Choose the configuration that maximizes calibration scores against costs, and latencies

Key considerations:

  • Balance accuracy improvements against cost increases.

  • Consider latency requirements for real-time applications. Note some workflows are not sensitive to latency (email, offline agent operations)

Flow 3: Offline Data Measurement and Scoring

Transform Existing Data into Actionable Insights

This flow applies evaluators to existing datasets or LLM input-ouput telemetry, enabling data quality assessment and filtering:

  1. Data Ingestion: Load transcripts, chat logs, or other text data

  2. Evaluator Application: Score each data point across the multiple evaluation dimensions

  3. Metadata Enrichment: Attach scores as searchable metadata

  4. Filtering and Analysis: Identify high/low quality samples, policy violations, or improvement opportunities

Applications:

  • Call center transcript analysis (clarity, policy alignment, customer satisfaction indicators)

  • Training data curation (identifying high-quality examples)

  • Compliance monitoring (detecting policy violations)

  • Quality assurance sampling (focusing review on problematic cases)

Flow 4: Automated Self-Improvement and Rectification with Evaluation Feedback**

This flow creates a feedback loop that automatically improves content based on evaluation results:

  1. Initial Evaluation: Score the original content with relevant evaluators

  2. Feedback Generation: Extract scores and justifications from evaluators

  3. Improvement Execution:

    • For LLM-generated content: Re-prompt the original model with evaluation feedback

    • For existing content: Pass to any LLM with improvement instructions based on evaluator feedback

  4. Verification: Re-evaluate to confirm improvements

Use cases:

  • Iterative response refinement in production

  • Batch improvement of historical data

  • Automated content enhancement pipelines

  • Self-improving AI systems

Flow 5: Guardrail Flow: Real-Time Protection Through Evaluation-Based Blocking

This flow implements safety and quality controls by preventing substandard LLM outputs from reaching users:

  1. Threshold Definition: Set minimum acceptable scores for critical evaluators

  2. Real-Time Evaluation: Score LLM outputs before delivery

  3. Conditional Blocking: Prevent responses that fall below thresholds from being served

  4. Fallback Handling: Trigger alternative responses or escalation procedures for blocked content

Implementation strategies:

  • Critical evaluators: Harmlessness, confidentiality, policy adherence

  • Quality thresholds: Minimum coherence, relevance, or completeness scores

  • Graceful degradation: Provide safe default responses when blocking occurs

  • Logging and alerting: Track blocked responses for system improvement

Applications:

  • Customer-facing chatbots requiring brand safety

  • Healthcare AI with strict accuracy requirements

  • Financial services with regulatory compliance needs

  • Educational tools requiring age-appropriate content

Flow 6: Lean Observation Flow

Zero-Impact Monitoring of LLM Traffic

This flow enables comprehensive observability without affecting application performance:

With Root Proxy (Simpler Implementation)

  1. Proxy Configuration: Route LLM traffic through Root Signals proxy

  2. Automatic Capture: All requests and responses logged transparently

  3. Asynchronous Processing of Evaluations: Evaluations occur out-of-band

  4. Dashboard Visibility: Real-time metrics

Benefits:

  • No code changes required in application, only base_url update

  • Automatic request/response pairing

  • Built-in retry and error handling

  • Centralized configuration management

Without Proxy (Direct Integration)

  1. Asynchronous Logging: Send request/response pairs to Root Signals API

  2. Non-Blocking Implementation: Use fire-and-forget pattern or background queues

  3. Batching Strategy: Aggregate logs for efficient transmission

  4. Resilient Design: Handle logging failures without affecting main flow

Benefits:

  • Full control over what gets logged

  • No network topology changes

  • Custom metadata enrichment

  • Selective logging based on business logic

Key considerations for both approaches:

  • Zero latency addition: Logging happens asynchronously

  • High-volume support: Handles production-scale traffic

  • Cost optimization: Sample high-volume, low-risk traffic

LangGraph

Agentic RAG with Root Signals Relevance Judge

Replication of Agentic RAG tutorial from LangGraph, where the decision of whether to use the retrieved content or not to answer a question is powered by Root Signals Evaluators.

The following is from LangGraph docs:

%%capture --no-stderr
%pip install -U --quiet langchain-community tiktoken langchain-openai langchainhub chromadb langchain langgraph langchain-text-splitters
import getpass
import os


def _set_env(key: str):
    if key not in os.environ:
        os.environ[key] = getpass.getpass(f"{key}:")


_set_env("OPENAI_API_KEY")
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import Annotated, Sequence, Literal
from typing_extensions import TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages
from langchain import hub
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from langgraph.prebuilt import tools_condition
from langchain.tools.retriever import create_retriever_tool
from langgraph.graph import END, StateGraph, START
from langgraph.prebuilt import ToolNode
import pprint

urls = [
    "https://www.rootsignals.ai/post/evalops",
    "https://www.rootsignals.ai/post/llm-as-a-judge-vs-human-evaluation",
    "https://www.rootsignals.ai/post/root-signals-bulletin-january-2025",
]

docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=50
)
doc_splits = text_splitter.split_documents(docs_list)

# Add to vectorDB
vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding=OpenAIEmbeddings(),
)
retriever = vectorstore.as_retriever()

retriever_tool = create_retriever_tool(
    retriever,
    "retrieve_blog_posts",
    "Search and return information about Root Signals blog posts on LLM evaluation.",
)

tools = [retriever_tool]

class AgentState(TypedDict):
    # The add_messages function defines how an update should be processed
    # Default is to replace. add_messages says "append"
    messages: Annotated[Sequence[BaseMessage], add_messages]
    
### Nodes
def agent(state):
    """
    Invokes the agent model to generate a response based on the current state. Given
    the question, it will decide to retrieve using the retriever tool, or simply end.

    Args:
        state (messages): The current state

    Returns:
        dict: The updated state with the agent response appended to messages
    """
    print("---CALL AGENT---")
    messages = state["messages"]
    model = ChatOpenAI(temperature=0, streaming=True, model="gpt-4-turbo")
    model = model.bind_tools(tools)
    response = model.invoke(messages)
    # We return a list, because this will get added to the existing list
    return {"messages": [response]}


def rewrite(state):
    """
    Transform the query to produce a better question.

    Args:
        state (messages): The current state

    Returns:
        dict: The updated state with re-phrased question
    """

    print("---TRANSFORM QUERY---")
    messages = state["messages"]
    question = messages[0].content

    msg = [
        HumanMessage(
            content=f""" \n 
    Look at the input and try to reason about the underlying semantic intent / meaning. \n 
    Here is the initial question:
    \n ------- \n
    {question} 
    \n ------- \n
    Formulate an improved question: """,
        )
    ]

    # Grader
    model = ChatOpenAI(temperature=0, model="gpt-4-0125-preview", streaming=True)
    response = model.invoke(msg)
    return {"messages": [response]}


def generate(state):
    """
    Generate answer

    Args:
        state (messages): The current state

    Returns:
         dict: The updated state with re-phrased question
    """
    print("---GENERATE---")
    messages = state["messages"]
    question = messages[0].content
    last_message = messages[-1]

    docs = last_message.content

    # Prompt
    prompt = hub.pull("rlm/rag-prompt")

    # LLM
    llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, streaming=True)

    # Post-processing
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    # Chain
    rag_chain = prompt | llm | StrOutputParser()

    # Run
    response = rag_chain.invoke({"context": docs, "question": question})
    return {"messages": [response]}


print("*" * 20 + "Prompt[rlm/rag-prompt]" + "*" * 20)
prompt = hub.pull("rlm/rag-prompt").pretty_print()  # Show what the prompt looks like

Define the Decision-maker as a Root Judge

Now we define Root Signals Relevance evaluator as the decision maker for whether the answer should come from retrieved docs or not. The advantage of using Root Signals (as opposed to original LangGraph method) is:

  • We can control the relevance threshold because Root Signals evaluators always return a normalized score between 0 and 1.

  • If we want, we can incorporate the Justification in the decision-making process.

  • The code is much shorter, i.e. about ⅓ of that of LangGraph tutorial.

from root import RootSignals

client = RootSignals()

def grade_relevance(state) -> Literal["generate", "rewrite"]:
    """
    Determines whether the retrieved documents are relevant to the question.

    Args:
        state (messages): The current state

    Returns:
        str: A decision for whether the documents are relevant or not
    """
    messages = state["messages"]
    question = messages[0].content
    docs = messages[-1].content

    result = client.evaluators.Relevance(
        request=question,
        response=docs,
    )
    if result.score > 0.5:  # we can control the threshold
        return "generate"
    else:
        return "rewrite"

Rest of the tutorial is still from LangGraph:

# Define a new graph
workflow = StateGraph(AgentState)

# Define the nodes we will cycle between
workflow.add_node("agent", agent)  # agent
retrieve = ToolNode([retriever_tool])
workflow.add_node("retrieve", retrieve)  # retrieval
workflow.add_node("rewrite", rewrite)  # Re-writing the question
workflow.add_node(
    "generate", generate
)  # Generating a response after we know the documents are relevant
# Call agent node to decide to retrieve or not
workflow.add_edge(START, "agent")

# Decide whether to retrieve
workflow.add_conditional_edges(
    "agent",
    # Assess agent decision
    tools_condition,
    {
        # Translate the condition outputs to nodes in our graph
        "tools": "retrieve",
        END: END,
    },
)

# Edges taken after the `action` node is called.
workflow.add_conditional_edges(
    "retrieve",
    # Assess agent decision
    grade_relevance,  # this is Root Signals evaluator
)
workflow.add_edge("generate", END)
workflow.add_edge("rewrite", "agent")

# Compile
graph = workflow.compile()

Our RAG Agent is ready:

inputs = {
    "messages": [
        ("user", "What is EvalOps?"),
    ]
}
for output in graph.stream(inputs):
    for key, value in output.items():
        pprint.pprint(f"Output from node '{key}':")
        pprint.pprint("---")
        pprint.pprint(value, indent=2, width=80, depth=None)
    pprint.pprint("\n---\n")

Usage

Evaluators

An evaluator is a metric for a piece of text that maps a string originating from a language model to a numeric value between 0 and 1. For example, an evaluator could measure the "Truthfulness" of the generated text.

Root Signals provides a rich collection of that you can use, such as:

  • Quality of professional writing: checks how grammatically correct, clear, concise and precise the output is

  • Completeness: evaluates how well the response addresses all aspects of the input request

  • Toxicity Detection: Identifies any toxic or inappropriate content

  • Faithfulness: Verifies the faithfulness of response with respect to a given context, acting as a hallucination detection, e.g. in RAG settings

  • Sentiment Analysis: Determines the overall sentiment (positive, negative, or neutral)

You can also define your own custom evaluators.

Objective

The objective of an evaluator consists of two components:

  1. Intent: This describes the purpose and goal of the evaluator, specifying what it aims to evaluate or assess in the response.

  2. Calibrator: It provides the ground truth set of appropriate numeric values for specific request-response pairs that defines the intended behavior of the evaluator. This set 'calibrates' its evaluation criteria and ensures consistent and accurate assessments.

Function

The function of an evaluator consists of three components:

  1. Prompt

  2. Demonstrations

  3. Model

Prompt

The prompt (or instruction) defines the instructions and variable content the evaluator prompts a large language model with. It should clearly specify the criteria and guidelines for assessing the quality and performance of responses.

Note: During execution, the prompt defined by the user is appended to a more general template containing instructions responsible for guiding and optimizing the behavior of the evaluator. Thus the user does not have to bother with with generic instructions such as "Give a score between 0 and 1". It is sufficient to describe the evaluation criteria of the specific evaluator at hand.

Example: How well does the {{response}} adhere to instructions given in {{request}}.

Variables in an evaluator

All variable types are available for an evaluator. However, some restrictions apply.

  • The prompt of an evaluator must contain a special variable named response that represents the LLM output to be evaluated.

  • It can also contain a special variable named request if the prompt that produced the input is considered relevant for evaluation.

request and response can be either input or reference variables. In the latter case the variable is associated with a dataset that can be searched for contextual information to support the evaluation, using Retrieval Augment Generation.

Demonstrations

A demonstration is a sample consisting of an response-request -pair (or just response, if request is not considered necessary for evaluation), an expected score, and optional justification. Demonstrations exemplify the expected behavior of the evaluator. Demonstration is provided to the model, and therefore must be strictly separated from calibration samples.

A justification illustrates the rationale for the given score. Justification can be helpful when the reason for a specific score is not obvious, allowing the model to pay attention to relevant aspects of the evaluated response and tackle ambiguous cases in a nuanced way.

Example:

A sample demonstration for an evaluator for determining if content is safe for children.

Model

The model refers to the specific language model or engine used to execute the evaluator. It should be chosen based on its capabilities and suitability for the evaluation task..

Calibration

Calibration is the response to the naturally arising question: How can we trust evaluation results? The calibrator provides a way to quantify the performance of the evaluator by providing the ground truth against which the evaluator can be gauged. The reference dataset that forms the calibrator defines the expected behaviour of the evaluator.

The samples of the calibration dataset are similar to the to those of the demonstration dataset, consisting of score, response, and optional request and optional justification.

On the Calibrator page:

  • The calibration dataset can be imported on a file or typed in the editor.

  • A synthetic dataset can be generated, edited, and appended.

Calibration page enables viewing the actual performance of the evaluator on the samples of the calibrator dataset with the expected performance defined by the scores of the set.

On this page:

  • Total deviance, which quantifies the average magnitude of the errors between predicted values by an evaluator and the actual observed values, can be calculated. A lower total deviance indicates that the evaluator's predictions are closer to the actual outcomes, which signifies better performance of the evaluator. The total deviance is computed using the Root Mean Square method.

  • Deviations for individual samples of the dataset are displayed, enabling easy identification of weak points of the evaluator. If a particular sample has a high deviation, there are characteristics in the sample that confuse the evaluator.

How to improve the performance an evaluator

To improve the performance or 'calibrate' an evaluator, adjustments can be made to one or more of the three key components: the prompt, the demonstrations, and the model.

Effective strategies for this can be deduced by examining the calibration results. Inspecting the worst-performing samples, those with the largest deviations, can help identify the evaluator's weak points.

Then, one or more steps can be taken:

  1. The instructions given in the prompt can be made more specific to adjust the behavior in the problem cases.

  2. Modify demonstration content by adding examples similar to the problematic samples, which can enhance performance in these areas. Additional instructions can be added by including a justification to a demonstration. Note: Maintaining a sufficiently large calibration dataset reduces the risk of overfitting, i.e., producing an evaluator tailored to the calibration but lacking generalization.

  3. The model can be changed. Overall performance can be improved by using a larger or otherwise better suited model, often at the cost of evaluation latency and price.

After each modification, it's advisable to recalculate the deviations to assess the direction and magnitude of the impact on performance.

Evaluator permissions

As evaluators are special type of skill, , apply to evaluator skills too.

List of Evaluators Maintained by Root Signals

  • Evaluators tagged with RAG Evaluator work properly when evaluating skills with reference variables. Alternatively, when not used to evaluate skill outputs, contexts parameter containing a set of documents as a list of strings—corresponding to the retrieved context data—must be passed.

  • Evaluators tagged with Ground Truth Evaluator can be used for evaluating tests sets that contain an expected_output column. When used through the SDK, expected_output parameter must be likewise passed.

  • Evaluators tagged with Function Call Evaluator can be used through SDK and require a functions parameter conforming to OpenAI compatible tools parameter to be passed.

  1. Relevance Assesses the relevance of the response in relation to the request by evaluating accuracy, completeness, adherence to the prompt, and logical consistency, to determine whether responses remain directly on-topic and informative.

  2. Safety for Children Checks the appropriateness of content for young audiences, focusing on avoiding language or themes that could be harmful or unsuitable for children, thus promoting safety and age-appropriateness.

  3. Sentiment Recognition Identifies the emotional tone of the response, determining whether it is predominantly positive, negative, or neutral, which helps in gauging the overall sentiment conveyed by the content.

  4. Coherence Assesses whether the response is logically consistent and contextually relevant, checking for smooth transitions between ideas and the maintenance of topic focus throughout the response.

  5. Conciseness Measures the brevity and directness of the response, aiming to convey information efficiently without the use of unnecessary words, thus avoiding redundancy.

  6. Engagingness Evaluates the ability of the response to engage readers, determining whether the content is interesting and likely to hold attention, or if it tends toward being boring or predictable.

  7. Originality Checks the originality and creativity of the response, focusing on providing unique and inventive perspectives rather than relying on clichéd or derivative ideas.

  8. Clarity Measures how easily the response can be understood on a first reading, aiming for ideas to be presented in a logical, straightforward manner with sufficient detail for clarity.

  9. Precision Assesses the accuracy and specificity of the response, aiming for responses that are detailed and exact, with no room for ambiguities or overly general statements.

  10. Completeness Evaluates how well the response addresses all aspects of the input request, ensuring that no important elements are overlooked and that comprehensive coverage is provided for multi-faceted queries or instructions.

  11. Persuasiveness Evaluates the persuasiveness of the response by examining the strength, clarity, and convincingness of the main argument or point, including how well it is supported by evidence, examples, and logical reasoning.

  12. Confidentiality Assesses the response for the presence of any personal details, private communications, sensitive business data, or any other information that should be kept confidential, focusing on maintaining privacy and security.

  13. Harmlessness Assesses the harmlessness of the response by checking for the presence of hate speech, slurs, derogatory language, and any incitement or glorification of violence. It evaluates the overall potential of the content to cause harm or distress.

  14. Formality Evaluates the formality of the response by considering factors such as word choice, sentence structure, tone, grammar, and overall style. This helps in matching the content to the expected level of formality for the context.

  15. Politeness Assesses the politeness of the response by examining factors such as word choice, tone, phrasing, and the overall level of respect and courtesy demonstrated in the response.

  16. Helpfulness Evaluates the helpfulness of the response by considering how useful, informative, and beneficial the text is to a reader seeking information. Helpful text provides clear, accurate, relevant, and comprehensive information to aid the reader's understanding and ability to take appropriate action.

  17. Non-toxicity Assesses the non-toxicity of the response. Text that is benign and completely harmless receives high scores.

  18. Faithfulness RAG Evaluator This corresponds to hallucination detection in RAG settings. Measures the factual consistency of the generated answer with respect to the context. It determines whether the response accurately reflects the information provided in the context. This is the high-accuracy variant of our set of Faithfulness evaluators.

  19. Faithfulness-swift RAG Evaluator

    This is the faster variant of our set of Faithfulness evaluators.

  20. Answer Relevance Measures how relevant a response is with respect to the prompt/query. Completeness and conciseness of the response are considered.

  21. Truthfulness RAG Evaluator Assesses factual accuracy by prioritizing context-backed claims over model knowledge, while preserving partial validity for logically consistent but unverifiable claims. Unlike Faithfulness, allows for valid model-sourced information beyond the context. This is the high-accuracy variant of our set of Truthfulness evaluators.

  22. Truthfulness-swift RAG Evaluator This is the faster variant of our set of Truthfulness evaluators.

  23. Quality of Writing - Professional Measures the quality of writing as a piece of academic or other professional text. It evaluates the formality, correctness, and appropriateness of the writing style, aiming to match professional standards.

  24. Quality of Writing - Creative Measures the quality of writing as a piece of creative text. It evaluates the creativity, expressiveness, and originality of the content, focusing on its impact and artistic expression.

  25. JSON Content Accuracy RAG Evaluator | Function Call Evaluator Checks if the content of the JSON response is accurate and matches the documents and instructions, verifying that the JSON data correctly represents the intended information.

  26. JSON Property Completeness Function Call Evaluator Checks how many of the required properties are present in the JSON response, verifying that all necessary fields are included. This is a string (non-LLM) evaluator.

  27. JSON Property Type Accuracy Function Call Evaluator Checks if the types of properties in the JSON response match the expected types, verifying that the data types are correct and consistent. This is a string (non-LLM) evaluator.

  28. JSON Property Name Accuracy Function Call Evaluator Checks if the names of properties in the JSON response match the expected names, verifying that the field names are correct and standardized. This is a string (non-LLM) evaluator.

  29. JSON Empty Values Ratio Function Call Evaluator Checks the portion of empty values in the JSON response, aiming to minimize missing information and ensure data completeness. This is a string (non-LLM) evaluator.

  30. Answer Semantic Similarity Ground Truth Evaluator Measures the semantic similarity between the generated answer and the ground truth, helping to evaluate how well the response mirrors the expected response.

  31. Answer Correctness Ground Truth Evaluator Measures the factual correspondence of the generated response against a user-supplied ground truth. It considers both semantic similarity and factual consistency.

  32. Context Recall RAG Evaluator | Ground Truth Evaluator Measures whether the retrieved context provides sufficient information to produce the ground truth response, evaluating if the context is relevant and comprehensive according to the expected output.

  33. Context Precision RAG Evaluator | Ground Truth Evaluator Measures the relevance of the retrieved contexts to the expected output.

  34. Summarization Quality Measures the quality of text summarization with high weights for clarity, conciseness, precision, and completeness.

  35. Translation Quality Quality of machine translation with high weights for accuracy, completeness, fluency, and cultural appropriateness.

  36. Planning Efficiency Quality of planning of an AI agent with high weights for efficiency, effectiveness, and goal-orientation.

  37. Information Density Information density of a response with high weights for concise, factual statements and penalizing vagueness, questions, or evasive answers.

  38. Reading Ease Evaluates the text for ease of reading, focusing on simple language, clear sentence structures, and overall clarity.

  39. Answer Willingness Answer willingness of a response with high weights for response presence, directness and penalty for response avoidance, refusal, or evasion.

Determinism

As our evaluators are LLM-judges, they are non-deterministic, i.e, the same input can result in slightly different score. We try to keep this fluctuation low. The expected standard deviations of each evaluator for 3 different dimensions are reported below: short/long context, single-turn / multi, low/high ground truth score:

Version Control

Both ready-made Root Evaluators and your Custom Evaluators have version control. Normally, you can call an evaluator as :

and if you want to call a specific version, you can add:

Request: "Is there a refund option?",
Response: "Yes, there is a refund option available. According to clause 4.2 of the terms of business, if the engagement terminates within the first 3 months (except in cases of redundancy), a refund will be provided based on the schedule outlined in the document.",
Score: 1.00,
Justification: "While difficult and boring for children, the text does not involve unsafe elements.",
client.evaluators.run(
    request="My internet is not working.",
    response="""
    I'm sorry to hear that your internet isn't working.
    Let's troubleshoot this step by step. What is your IP address?
    """,
    evaluator_id="bd789257-f458-4e9e-8ce9-fa6e86dc3fb9",  # e.g. corresponding to Relevance
)
evaluator_version_id="7c099204-4a41-4d56-b162-55aac24f6a47"
pre-built evaluators
the permission controls that apply to all skills
Determinism Metrics
Similar or diverse test data can be automatically generated from even a single sample.
Click on Versions (right side) to see all versions of an evaluator

Trust Center

Coming in August 2025 - meanwhile, contact [email protected] for the full compliance & security documentation.

Self-hosting

Documentation merging in progress - meanwhile, contact [email protected] for the full documentation

Roadmap

Root Signals builds with the philosophy of transparency with multiple open source projects. This roadmap is a living document about what we're working on and what's next. Root Signals is the world's most principled and powerful system for measuring the behavior of LLM based applications, agents and workflows.

Scorable is the automated LLM Evaluation Engineer agent for co-managing this platform with you.

Vision

Our visin is to create and auto-optimize the strongest automated knowledge process evaluation stack possible, with the least amount of effort and information from the user.

  • Maximum Automated Information Extraction

    • From user intent and/or provided example/instruction data, extract as much relevant information as possible.

  • Awareness of the information quality

    • Engage the user with the smallest amount of maximally impactful questions.

  • Maximally Powerful Evaluation Stack Generation

    • Build the most comprehensive and accurate evaluation capabilities possible, within the confines of data available.

  • Built for Agents

    • Maximum compatibility with autonomous agents and workflows.

  • Maximum Integration Surface

    • Seamless integration with all key AI frameworks.

  • EvalOps Principles for Long Term

    • Follow Root EvalOps Principles for evaluator lifecycle management.

  • Principled Evaluator Infrastructure

All feedback is highly appreciated and often leads to immediate action. Submit new GitHub issues or vote on existing ones, so we can take quick action on what is important to you.

🚀 Recently Released

  • ✅ Automated Policy Adherence Judges

    • Create judges from uploaded policy documents and intents

  • ✅ GDPR awareness of models (link)

    • Ability to filter out models not complying with GDPR

  • ✅ Evaluator Calibration Data Synthesizer v1.0 (link)

    • In the evaluator drill-in view, expand your calibration dataset from 1 or more examples

  • ✅ Evaluator version history and control to include all native Root Evaluators (link)

  • ✅ Evaluator determinism benchmarks and standard deviations in reference datasets (link)

  • ✅ Agent Evaluation MCP: stdio & SSE versions (link)

  • ✅ Root Judge LLM 70B judge available for download and running in Root Signals for free!

🏗️ Next Up

  • Public Evaluation Reports

    • Generate HTML reports from any judge execution

  • TypeScript SDK

  • Rehashing of Example-driven Evaluation

    • Smoothly create the full judge from examples

  • Native Speech Evaluator API

    • Upload or stream audio directly to evaluators

  • Unified Experiments framework to Replace Skill Tests

  • Command Line Interface

  • Advanced Judge visibility controls

    • RBAC coverage on Judges (as in Evaluators, Skills and Datasets)

  • Output Refinement At-Origin

    • Refine your LLM outputs automatically based on scores

🗓️ Planned

Scorable Features

  • Agentic Classifier Generation 2.0

    • Create classifiers with the same robustness as metric evaluator stacks

  • Automatic Context Engineering

    • Refine your prompt templates automatically based on scores

  • Support all RAG evaluators

Core Platform Features

  • Improved Playground View

Root Evaluators

  • Agent Evaluation Pack 2.0

  • (Root Evaluator list expanding every 1-2 weeks, stay tuned)

Integrations

  • Full OpenTelemetry Support

  • LiteLLM Direct Support

  • OpenRouter Support

  • (more coming)

Developer Tools

  • Sync Judge & Evaluator Definitions to GitHub

Community & Deployment

  • Community Evals

  • Self-Hostable Evaluation Executor

MCP

  • Remote MCP Server

  • MCP Feature Extension Pack

    • Full judge feature access

    • Full log insights access

Models support

  • Reasoner-specific model parameters (incl. budget) in evaluators

  • (model support list continuously expanded, stay tuned)

More Planned Features coming as we sync our changelogs and the rest of the internal roadmap contents!

Feature Requests and Bug Reports:

🐛 Bug Reports: GitHub Issues

📧 Enterprise Features: Contact [email protected]

💡 General: Discord


Last updated: 2025-06-30

LlamaIndex

Coming Soon!

Haystack

Example requires Haystack version 2.2.0 or later

To unlock full functionality, create a custom component to wrap the RS skill that supports Root Signals Validators

from typing import Dict
from typing import List
from haystack import component
from root import RootSignals
from root.validators import Validator

@component
class RootSignalsGenerator:
    """
    Component to enable skill use
    """
    def __init__(self, name: str, intent: str, prompt: str, model: str, validators: List[Validator]):
        self.client = RootSignals()
        self.skill = self.client.skills.create(
            name=name,
            intent=intent,
            prompt=prompt,
            model=model,
            validators=validators,
        )

For convenience, lets create another component to parse validation results

from root.generated.openapi_client.models.skill_execution_result import SkillExecutionResult

@component
class RootSignalsValidationResultParser:
    @component.output_types(passed=bool, details=Dict[str, [str, float, bool]])
    def run(self, replies: Dict[str, SkillExecutionResult]):
        return {"passed": replies.validation['is_valid']}

We are now equipped to have any OpenAI compatible generator being replaced with a Validated one, based on the RootSignalsGenerator component.

from haystack.dataclasses import ChatMessage
from haystack.core.pipeline.pipeline import Pipeline
from haystack.components.builders.dynamic_chat_prompt_builder import DynamicChatPromptBuilder

generator_A = RootSignalsGenerator(
    name="My Q&A chatbot",
    intent="Simple Q&A chatbot",
    prompt="Provide a clear answer to the question: {{question}}",
    model="gpt-4o",
    validators=[Validator(evaluator_name="Clarity", threshold=0.6)]
)
pipeline = Pipeline(max_loops_allowed=1)
pipeline.add_component("prompt_builder", DynamicChatPromptBuilder())
pipeline.add_component("generator_A", generator_A)
pipeline.add_component("validation_parser", RootSignalsValidationResultParser())

pipeline.connect("prompt_builder.prompt", "generator_A.messages")
pipeline.connect("generator_A.replies", "validation_parser.replies")

prompt_template = """
    Answer the question below.
    
    Question: {{question}}
    """

result = pipeline.run(
    {
        "prompt_builder": {
            "prompt_source": [ChatMessage.from_user(prompt_template)],
            "template_variables": {
                "question": "In the field of software development, what is the meaning and significance of 'containerization'? Use a popular technology as example. Cite sources where available."
            },
        }
    },
    include_outputs_from={
        "generator_A",
        "validation_parser",
    },
)
{
  'validation_parser': {'passed': True},  # use this directly, i.e. for haystack routers
  'generator_A': {  # full response from the generator, use llm_output for the plain response
  'replies': SkillExecutionResult(
    llm_output='Containerization in software development refers to the practice of encapsulating an application and its dependencies into a "container" that can run consistently across different computing environments. This approach ensures that the software behaves the same regardless of where it is deployed <truncated> \nSources:\n- Docker. "What is a Container?" Docker, https://www.docker.com/resources/what-container.\n- Red Hat. "What is containerization?" Red Hat, https://www.redhat.com/en/topics/containers/what-is-containerization.',
    validation={'is_valid': True, 'validator_results': [{'evaluator_name': 'Clarity', 'evaluator_id': '603eae60-790b-4215-b6d3-301c16fc37c5', 'result': 0.85, 'threshold': 0.6, 'cost': 0.006645000000000001, 'is_valid': True, 'status': 'finished'}]}, 
    model='gpt-4o', 
    execution_log_id='1fbdd6fc-f5a7-4e30-a7dc-15549b7557ec', 
    rendered_prompt="Provide a clear answer to the question: Answer the question below.\n \n Question: In the field of software development, what is the meaning and significance of 'containerization'? Use a popular technology as example. Cite sources where available.", 
    cost=0.003835)
    }
}

Why Anything?

Why is Evals suddenly a thing now?

Evaluation has been critical in all Machine Learning/AI systems for decades but it was not a problem until generative AI. Because back then, we had our test sets with ground truth annotations and simply calculated proportions/ratios (e.g. accuracy, precision, recall, AUROC, f1-score etc.) or well defined metric formulas (e.g. mean absolute error, some custom metric etc.) to estimate the performance of our systems. If we were satisfied with the accuracy & latency, we deployed our AI models to production.

We can not do that anymore because LLMs

  1. output free text instead of pre-defined categories or numerical values

  2. are non-deterministic

  3. are instructable by semantic guidance in other words they have a prompt. Their behaviour depends on the which is difficult to predict beforehand.

Therefore, applications powered by LLMs are inherently unpredictable, unreliable, weird, and in general hard to control.

This is the main blocker of large scale adoption and value creation with Generative AI. To overcome this, we need a new way of measuring, monitoring, and guardrailing AI systems.

But there are LLM benchmarks, no?

Yes, there are numerous LLM benchmarks and leaderboards, yet

  • They measure LLMs, not LLM applications. Benchmarks are interested in low-level academic metrics that are far away from business goals.

  • Tasks and samples in those benchmarks do not reflect real-life use cases. For example, multiple choice high school geometry question answering performance is not relevant when one is developing a customer support chatbot that should not hallucinate.

  • Benchmarks are full of low quality, incomplete, ambiguous, and erroneous samples

  • Data leakage is rampant. Consciously or not, test samples or slight variations of them are often leaked into the training data.

  • Benchmarks are not always transparent about what kind of settings they use in the experiments (e.g. temperature, zero-shot or few-shot, prompts) and are hard to replicate.

In short,

You want to measure and monitor your specific LLM-powered automation, not the generic academic capabilities of an LLM.

Datasets

Datasets in Root Signals contain static information that can be included as context for skill execution. They allow you to provide additional data to your skills, such as information about your organization, products, customers, or any other relevant domain knowledge.

By leveraging data sets, you can enhance the capabilities of your skills and provide them with relevant domain knowledge or test data to ensure their performance and accuracy.

Importing a Data Set via SDK

See SDK documentation.

Importing a Data Set via UI

To import a new data set:

  • Navigate to the Data Sets view.

  • Click the "Import Data Set" button on the top right corner of the screen.

  • Enter a name for your data set. If no name is provided, the file name will be used as the data set name.

  • Choose the data set type:

    • Reference Data: Used for skills that require additional context.

    • Test Data: Used for defining test cases and validating skill or evaluator performance.

  • Select a tag for the data set or create a new one.

  • Either upload a file or provide a URL from which the system can retrieve the data.

  • Preview the data set by clicking the "Preview" button on the bottom right corner.

  • Save the data set by clicking the "Submit" button.

Using Data Sets in Skills

Reference Data Sets

Data sets can be linked to skills using reference variables. When defining a skill, you can choose a data set as a reference variable, and the skill will have access to that data set during execution. This allows you to provide additional context or information to the skill based on the selected data set.

Test Data Sets

When creating a new skill or an evaluator, you have the option to select a test data or a calibration data set, correspondingly, to drive the skill or evaluator with multiple predefined sequential inputs for the skill's performance evaluation.

Root Signals allows you to test your skill against multiple models simultaneously. In the "Prompts" and "Models" sections of the skill creation form, you can add multiple prompt variants and select one or more models to be tested, correspondingly. By clicking the "Test" / "Calibrate" button in the bottom right corner, the system will run tests using your selected test data set against each of the chosen prompts and models. This feature enables you to compare their performance and select the one with the best trade-offs for your use case.