Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Objectives consist of a human-readable Intent and ground truth examples directly. An objective serves both the purposes of
Communication: Expressing the intended business purpose of the evaluator
Coordination: Serving as a battery of measures
Advanced use cases and common recipes
Root Signals is a measurement, observability, and control platform for GenAI applications, automations, and agentic workflows powered by Large Language Models (LLMs). Such applications include chatbots, Retrieval Augmented Generation (RAG) systems, agents, data extractors, summarizers, translators, AI assistants, and various automations powered by LLMs.
Any developer can use Root Signals to:
Add appropriate metrics such as Truthfulness, Answer Relevance, or Coherence of responses to any LLM pipeline and optimize their design choices (which LLM to use, prompt, RAG hyper-parameters, etc.) using these measurements:
Log, record, and compare their changes and the corresponding measurements belonging to those
Integrate metrics into CI/CD pipelines (e.g. GitHub actions) to prevent regressions
Turn those metrics into guardrails that prevent wrong, inappropriate, undesirable, or in general sub-optimal behaviour of their LLM apps simply by adding trigger thresholds. Monitor the performance in real-time, in production.
Create custom metrics for attributes ranging from 'mention of politics' to 'adherence to our communications policy document v3.1'.
The dashboard provides a comprehensive overview of the performance of your specific LLM applications:
Root Signals provides 30+ built-in, ready-to-use evaluators called Root Evaluators.
Utilizing any LLM as a judge, you can create, benchmark, and tune Custom Evaluators.
We provide complete observability to your LLM applications through our Monitoring view.
Root Signals is available via
🖥️ Web UI
📑 REST API
🔌 Model Context Protocol (MCP) Server (for Agents)
Root Signals can be used by individuals and organizations. Role-based Access Controls (RBAC), SLA, and security definitions are available for organizations. Enterprise customers also enjoy SSO signups via Okta and SAML.
Create a free Root Signals account and get started in 30 seconds.
Models are the actual source of the intelligence. A model generally refers to both the type of model (such as GPT), the provider of the model (such as Azure), and the specific variant (such as GPT-4o). The models available on the Root Signals platform consist of:
Proprietary and hosted open-source models accessible via API. These models can be accessed via your API key or the Root Signals platform key (with corresponding billing responsibilities).
Open-source models provided by Root Signals.
Models added by your organization. See the for model details.
Some model providers are GDPR-compliant, ensuring data processing meets the General Data Protection Regulation requirements. However, please note that GDPR compliance by the provider does not necessarily mean that data is processed within the EU.
Organization admin can control the API keys and restrict access to a specific subset of models.
Datasets in Root Signals contain static information that can be included as context for skill execution. They can contain information about your organization, products, customers, etc. Datasets are linked to skills using reference variables.
Access to datasets is controlled through permissions. By default, when a user uploads a new dataset, it is set to 'unlisted' status. Unlisted datasets are only visible to the user who created them and to administrators in the organization. This allows users to work on datasets privately until they are ready to be shared with others.
To make a dataset available to other users in the organization, the dataset owner or an administrator needs to change the status to 'listed'. Listed datasets are visible to all users in the organization and can be used in skills by anyone.
Note that dataset permissions control whether a dataset can be used in skill creation or skill editing as a reference variable or as a test data set. Unless more specific permissions information is made available via enterprise integrations, dataset permissions do not control who can use the data set in skill execution. I.e. once dataset in fixed to a skill as a reference variable, anyone who has privileges to execute the skill will also have implicit access to the data set through the skill execution.
It is important for dataset owners and administrators to carefully consider the sensitivity and relevance of datasets before making them widely available. Datasets may contain confidential or proprietary information that should only be accessible to authorized users.
Contact Root Signals for more fine-grained controls in enterprise, regulated or governmental contexts.
The dataset permission system in Root Signals allows for granular control over who can access and use specific datasets. The unlisted/listed status toggle and the special privileges granted to administrators provide flexibility in managing data assets across the organization. Proper management of dataset permissions is crucial for ensuring data security and relevance in skill development and execution.
The requests to any models wrapped within skill objects, and their responses, are traceable within the log objects or Root Signals platform.
The retention of logs is determined by your platform license. You may export logs at any point for your local storage. Access to execution logs is restricted based on your user role and skill-specific access permissions.
Objectives, evaluators, skills and test datasets are strictly versioned. The version history allows keeping track of all local changes that could affect the execution.
To understand reproducibility of pipelines of generative models, these general principles hold:
For any models, we can control for the exact inputs to the model, record the responses received, and the evaluator results of each run.
For open source models, we can pinpoint the exact version of the model (weights) being used, if this is guaranteed by the model provider, or if the provider is Root Signals.
For proprietary models whose weights are not available, we can pinpoint the version based on the version information given by the providers (such as gpt-4-turbo-2024-04-09) but we cannot guarantee those models are, in reality, fully immutable
Any LLM request with 'temperature' parameter above 0 is guaranteed not to be deterministic. Temperature = 0 and/or a fixed value of a 'seed' parameter usually mean the result is deterministic, but your mileage may vary.
Access to all Root Signals entities is controlled via user roles as well as entity-specific settings by the entity creator. The fundamental roles are the User and the Administrator.
Administrators have additional entity management privileges when it comes to management of datasets, objectives, skills and evaluators:
Administrators can see all the entities in the organization, including unlisted ones. This allows them to have an overview of all the data and functional assets in the organization.
Administrators can change the status of any entity such as a dataset from unlisted to listed and vice versa. This enables them to control which entities are shared with the wider organization.
Administrators can delete any entity, regardless of who created it. This is useful for managing obsolete or irrelevant entities.
Administrator also controls the accessibility of models across the organization, users, and billing.
Root Signals provides evaluators for RAG use cases, where you can give the context as part of the evaluated content.
One such evaluator is the Truthfulness evaluator, which measures the factual consistency of the generated answer against the given context and general knowledge.
Here is an example of running the Truthfulness evaluator using the Python SDK. Pass the context used to get the LLM response in the contexts parameter.
from root import RootSignals
# Connect to the Root Signals API
client = RootSignals()
result = client.evaluators.Truthfulness(
request="What was the revenue in Q1/2023",
response="The revenue in the last quarter was 5.2 M USD",
contexts=[
"Financial statement of 2023"
"2023 revenue and expenses...",
],
)
print(result.score)
# 0.5
Both our Skills and Evaluators may be used as custom-generator LLMs in 3rd party frameworks and we are committed to support OpenAI ChatResponse compatible API.
Note, however, that additional functionality, such as validation results, calibration etc., are not available as part of OpenAI responses and require the user to implement additional code if anything besides failing on unsuccessful validation is required.
Advanced use-cases can rely on referencing the completion.id
returned by our API as unique identifier for downstream tasks. Please refer to the Cookbook section for details.
We adhere to Semantic Versioning (SemVer) principles to manage the versions of our software products effectively. This ensures clarity and predictability in how updates and changes are handled.
Communication of Breaking Changes
Notification: All breaking changes are communicated to stakeholders via email. These notifications provide details about the nature of the change, the reasons behind it, and guidance on how to adapt to these changes.
Versioning: When a breaking change is introduced, the major version number of the software is incremented. For example, an upgrade from version 1.4.5 to 2.0.0 indicates the introduction of changes that may disrupt existing workflows or dependencies.
Documentation: Each major release accompanied by breaking changes includes updated documentation that highlights these changes and provides comprehensive migration instructions to assist in transitioning smoothly
In Root Signals, evaluation is treated as a procedure to compute a metric grounded on a human-defined criteria, emphasizing the separation of utility grounding (Objective) and implementation (Evaluator function).
This lets the criteria and implementations for the evaluations evolve in two separate controlled and trackable tracks, each with different version control logic.
Metric evaluators are different from other entities in the world, and simply treating them as "grounded in data", on one hand, or as "tests", on the other, misses some of their core properties.
In Root Signals, an Objective consist of
Intent that is human-defined and human-understanable, corresponding to the precise attribute being meausred.
Calibration data set that defines, via examples, the structure and scale of those criteria.
An Evaluator function consists of:
Predicate that uniquely specifies the task to the LLMs that power the evaluator
LLM
In-context examples (demonstrations)
[Optionally] Associated data files
An Evaluator function is typically associated with an Objective that connects it to business / contextual value, but the two have no causal connection.
Root Signals platform itself handles:
Semantic quantization: Guaranteeing the predicates are consistently mapped to metrics (for supported LLMs). This lets us abstract the predicates out of the boilerplate prompts needed to yield robust metrics
Version control of evaluator implementations
Maintenance of relationships* between Objectives and Evaluators
Monitoring
E.g. If an Objective is changed (e.g. it's calibration dataset is altered), it is not a priori clear if the related criteria, which then affect all evaluator variants using the Objective, rendering measurements backwards-incompatible. Hence, the best-practise enforced by Root Signals platform is to create an entirely new Objective, so that it is clear the criteria have changed. This can be bypassed, however, when the Objective is still in formation stage and/or you accept that the criteria will change over time.
Over time, improved evaluator functions will be created (including but not limited to model updates) to improve upon the Objective targets. On the other hand, Objectives tend to branch and become more precise over time, passing the burden of resolving the question of "is this still the same Objective" to the users, while providing the software support to make those calls either way in an auditable and controllable manner.
Scorable is the automated LLM Evaluation Engineer agent for co-managing Root Signals platform with you.
To get started, sign up and login to Root Signals app. Select an evaluator under Evaluators tab and Execute. You will get a score between 0
and 1
and the justification for the score.
Create your Root Signals API key under Settings > Developer.
pip install root-signals
Root Signals provides over 30 evaluators or judges, which you can use to score any text based on a wealth of metrics. You can attach evaluators to an existing application with just a few lines of code.
from root import RootSignals
# Just a quick test?
# You can get a temporary API key from https://app.rootsignals.ai/demo-user
client = RootSignals(api_key="my-developer-key")
client.evaluators.Politeness(
response="You can find the instructions from our Careers page."
)
# {score=0.7, justification='The response is st...', execution_log_id=...}
npm install @root-signals/typescript-sdk
# or
yarn add @root-signals/typescript-sdk
# or
pnpm add @root-signals/typescript-sdk
and execute:
import { RootSignals } from '@root-signals/typescript-sdk';
// Connect to Root Signals API
const client = new RootSignals({
apiKey: process.env.ROOTSIGNALS_API_KEY!
});
// Run any of our ready-made evaluators
const result = await client.evaluators.executeByName('Helpfulness', {
response: "You can find the instructions from our Careers page."
});
You can execute evaluators in your favourite framework and tech stack via our REST API:
Our Model Context Protocol (MCP) equips your AI agents with evaluation capabilities.
Root Signals design philosophy starts from the principle of extreme semantic rigor. Briefly, this means making sure that, for example
The definitions and references of entities are tracked with maximal (and increasing) precision
Entities are assumed long-term and upgradeable
Entities are built for re-use
Changes will be auditable
Objective defines what you intend to achieve. It grounds an AI automation to the business target, such as providing a feature ('transform data source X into usable format Y') or a value ('suitability for use in Z').
Evaluator is a function that assigns a numeric value to a piece of content such as text, along a semantically defined dimension (truthfulness, relevance of an answer, coherence, etc.).
Judge is a stack of evaluators with a high-level Intent.
Model is the AI model such as an LLM that provides the semantic processing of the inputs. Notably, the list contains both API-based models such as OpenAI and Anthropic models, and open source models such as Llama and Mistral models. Finally, you can add your own locally running models to the list with ease. The organization Admin controls the availability of models enabled in your organization.
To ensure the reliability of the Direct Language evaluator, you can create and use test data, referred to as a calibration dataset. A calibration set is a collection of LLM outputs, prompts, and expected scores that serve as benchmarks for evaluator performance.
Start by attaching an empty calibration set to the evaluator:
Navigate to the Direct Language evaluator page and click Edit.
Select the Calibration section and click Add Dataset.
Name the dataset (e.g., “Direct Language Calibration Set”).
Optionally, add sample rows, such as:
"0,2","I am pretty sure that is what we need to do"
Click Save and close the dataset editor.
Optionally, click the Calibrate button to run the calibration set.
Save the evaluator
You can enhance your calibration set using real-world data from evaluator runs stored in the execution log.
Go to the Execution Logs page.
Locate a relevant evaluator run and click on it.
Click Add to Calibration Dataset to include its output and score in the calibration set.
By regularly updating and running the calibration set, you safeguard the evaluator against unexpected behavior, ensuring its continued accuracy and reliability.
List of well-calibrated Root evaluators include:
Relevance
Safety for Children
Sentiment Recognition
Coherence
Conciseness
Engagingness
Originality
Clarity
Precision
Persuasiveness
Confidentiality
Harmlessness
Formality
Politeness
Helpfulness
Non-toxicity
Faithfulness RAG Evaluator
Faithfulness-swift RAG Evaluator
Answer Relevance
Truthfulness RAG Evaluator
Truthfulness-swift RAG Evaluator
Quality of Writing - Professional
Quality of Writing - Creative
JSON Content Accuracy RAG Evaluator | Function Call Evaluator
JSON Property Completeness Function Call Evaluator
JSON Property Type Accuracy Function Call Evaluator
JSON Property Name Accuracy Function Call Evaluator
JSON Empty Values Ratio Function Call Evaluator
Answer Semantic Similarity Ground Truth Evaluator
Answer Correctness Ground Truth Evaluator
Context Recall RAG Evaluator | Ground Truth Evaluator
Context Precision RAG Evaluator | Ground Truth Evaluator
Summarization Quality
Translation Quality
Information Density
Reading Ease
Planning Efficiency
Answer Willingness
Details of each evaluator can be found .
Judges are stacks of with their own high-level intent.
You can see the overview of your Judges in the app:
You can inspect a Judge in detail as well:
Via OpenAI-compatible Endpoint
cURL
Python
Root Signals provides evaluators that fit most needs, but you can add custom evaluators for specific needs. In this guide, we will add a custom evaluator and tune its performance using demonstrations.
Consider a use case where you need to evaluate a text based on its number of weasel words or ambiguous phrases. Root Signals provides the optimized Precision evaluator for this, but let's build something similar to go through the evaluator-building process.
Navigate to the Evaluator Page:
Go to the evaluator page and click on "New Evaluator."
Name Your Evaluator:
Type the name for the evaluator, for example, "Direct language."
Define the Intent:
Give the evaluator an intent, such as "Ensures the text does not contain weasel words."
Create the Prompt:
"Is the following text clear and has no weasel words"
Add a placeholder (variable) for the text to evaluate:
Click on the "Add Variable" button to add a placeholder for the text to evaluate.
E.g., "Is the following text clear and has no weasel words: {{response}}"
Select the Model:
Choose the model, such as gpt-4-turbo, for this evaluation.
Save and Test the Evaluator:
Click Create evaluator and .
You can add demonstrations to the evaluator to tune its scores to match more closely to the desired behavior.
Let's penalize using the word "probably"
Go to the Weasel words evaluator and click Edit
Click Add under Demonstrations section
Add a demonstration
Type to the Response field: "This solution will probably work for most users."
Score: 0,1
Save the evaluator and try it out
Note that adding more demonstrations, such as
"The project will probably be completed on time."
"We probably won't need to make any major changes."
"He probably knows the answer to your question."
"There will probably be a meeting tomorrow."
"It will probably rain later today."
will further adjust the evaluator's behavior. Refer to the full evaluator for more information.
Building production-ready and reliable AI applications requires safeguards provided by an evaluation layer. LLM responses can vary drastically based on even the slightest input changes.
Root Signals provides a robust set of fundamental evaluators suitable for any LLM-based application.
You need a few examples of LLM outputs (text). Those can be from any source, such as a summarization output on a given topic.
shows all evaluators at your disposal. Root Signals provides the base evaluators, but you can also build custom evaluators for specific needs.
Let's start with the Precision evaluator. Based on the text you want to evaluate, feel free to try other evaluators as well.
Click on the Precision evaluator and then click on the Execute skill button.
Paste the text you want to evaluate into the output field and click Execute. You will get a numeric score based on the metric the evaluator is evaluating and the text to evaluate.
An individual score is not very interesting. The power of evaluation lies in integrating evaluators into an LLM application.
Integrating the evaluators as part of your LLM application is a more systematic approach to evaluating LLM outputs. That way, you can compare the scores over time and take action based on the evaluation results.
The Precision evaluator details page contains information on how to add it to your application. First, you must fetch a Root Signals API key and then execute the example cURL command.
Go to the Precision evaluator details page
Click on the Add to your application link
Copy the cURL command
You can omit the request
field from the data payload and add the text to evaluate in the response
field.
Example (cURL)
Root Signals subscription provides you can use in any of your skills. You are not limited by that selection, though. Integrating with cloud providers' models or connecting to locally hosted models is possible or the REST API.
To use an , add the model endpoint via the SDK or through the REST API
After adding the model, you can use it like any other model in your skills and evaluators.
curl --request POST \
--url https://api.app.rootsignals.ai/v1/models/ \
--header 'Authorization: Api-Key $ROOT_SIGNALS_API_KEY' \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--data '
{
"name": "huggingface/meta-llama/Meta-Llama-3-8B",
"url": "https://my-endpoint.huggingface.cloud",
"default_key": "$HF_KEY"
}
'
skill = client.skills.create(
name="My model test", prompt="Hello, my model!", model="huggingface/meta-llama/Meta-Llama-3-8B"
)
# pip install openai
from openai import OpenAI
client = OpenAI(
api_key="$MY_API_KEY",
base_url="https://api.app.rootsignals.ai/v1/judges/$MY_JUDGE_ID/openai/"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "I want to return my product"}
]
)
print(f"Assistant's response: {response.choices[0].message.content}")
print(f"Judge evaluation results: {response.model_extra.get('evaluator_results')}")
curl 'https://api.app.rootsignals.ai/v1/judges/$MY_JUDGE_ID/execute/' \
-H 'authorization: Api-Key $MY_API_KEY' \
-H 'content-type: application/json' \
--data-raw '{"response":"LLM said: You can return the item within 30 days of purchase, and we will refund the full amount...","request":"I want to return my product"}'
# pip install root-signals
from root import RootSignals
client = RootSignals(api_key="$MY_API_KEY")
result = client.judges.run(
judge_id="$MY_JUDGE_ID",
response="LLM said: You can return the item within 30 days of purchase, and we will refund the full amount...",
request="I want to return my product"
)
print(f"Run results: {result.evaluator_results}")
# Score (a float between 0 and 1): {result.evaluator_results[0].score}
# Justification for the score: {result.evaluator_results[0].justification}
curl 'https://api.app.rootsignals.ai/v1/skills/evaluator/execute/767bdd49-5f8c-48ca-8324-dfd6be7f8a79/' \
-H 'authorization: Api-Key <YOUR API KEY>' \
-H 'content-type: application/json' \
--data-raw '{"response":"While large language models (LLMs) have many powerful applications, there are scenarios where they are not as effective or suitable. Here are some use cases where LLMs may not be useful:\n\nReal-Time Critical Systems:\nLLMs are not ideal for applications requiring real-time, critical decision-making, such as air traffic control, medical emergency systems, or autonomous vehicle navigation, where delays or errors can have severe consequences.\n\nHighly Specialized Expert Tasks:\nTasks that require deep domain-specific expertise, such as advanced scientific research, complex legal analysis, or detailed medical diagnosis, may be beyond the capabilities of LLMs due to the need for precise, highly specialized knowledge and judgment."}'
# pip install root-signals
from root import RootSignals
client = RootSignals(api_key="<YOUR API KEY>")
client.evaluators.Precision(
response="While large language models (LLMs) have many powerful applications, there are scenarios where they are not as effective or suitable. Here are some use cases where LLMs may not be useful:\n\nReal-Time Critical Systems:\nLLMs are not ideal for applications requiring real-time, critical decision-making, such as air traffic control, medical emergency systems, or autonomous vehicle navigation, where delays or errors can have severe consequences.\n\nHighly Specialized Expert Tasks:\nTasks that require deep domain-specific expertise, such as advanced scientific research, complex legal analysis, or detailed medical diagnosis, may be beyond the capabilities of LLMs due to the need for precise, highly specialized knowledge and judgment."
)
Example requires langfuse v3.0.0
import os
from langfuse import observe, get_client
from root import RootSignals
# Initialize Langfuse client using environment variables
# LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_HOST
langfuse = get_client()
# Initialize RootSignals client
rs = RootSignals()
observe(name="explain_concept_generation") # Name for traces in Langfuse UI
def explain_concept(topic: str) -> tuple[str | None, str | None]: # Returns content and trace_id
# Get the trace_id for the current operation, created by @observe
current_trace_id = langfuse.get_current_trace_id()
prompt = prompt_template.format(question=topic)
response_obj = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
model="gpt-4",
)
content = response_obj.choices.message.content
return content, current_trace_id
def evaluate_concept(request: str, response: str, trace_id: str) -> None:
# Invoke a specific Root Signals judge
result = rs.judges.run(
judge_id="4d369224-dcfa-45e9-939d-075fa1dad99e",
request=request, # The input/prompt provided to the LLM
response=response, # The LLM's output to be evaluated
)
# Iterate through evaluation results and log them as Langfuse scores
for eval_result in result.evaluator_results:
langfuse.create_score(
trace_id=trace_id, # Links score to the specific Langfuse trace
name=eval_result.evaluator_name, # Name of the Root Signals evaluator (e.g., "Truthfulness")
value=eval_result.score, # Numerical score from the evaluator
comment=eval_result.justification, # Explanation for the score
)
Table: Mapping Root Signals Output to Langfuse Score Parameters
Root Signals
Langfuse
Description in Langfuse Context
evaluator_name
name
The name of the evaluation criterion (e.g., "Hallucination," "Conciseness"). Used for identifying and filtering scores.
score
value
The numerical score assigned by the Root Signals evaluator.
justification
comment
The textual explanation from Root Signals for the score, providing qualitative insight into the evaluation
Done. Now you can explore detailed traces and metrics in the Langfuse dashboard.
The Root Signals platform is built upon foundational principles that ensure semantic rigor, measurement accuracy, and operational flexibility. These principles guide the design and implementation of all platform features, from evaluator creation to production deployment.
At the core of Root Signals lies a fundamental distinction between what should be measured and how it is measured:
An Objective defines the precise semantic criteria and measurement scale for evaluation.
An Evaluator represents an implementation that can meet these criteria.
This separation enables:
Multiple evaluator implementations for the same objective
Evolution of measurement techniques without changing business requirements
Clear communication between stakeholders about evaluation goals
Standardized benchmarking across different implementations
In practice, an objective consists of an Intent (describing the purpose and goal) and a Calibrator (the score-annotated dataset providing ground truth examples). The evaluator's function—comprised of prompt, demonstrations, and model—represents just one possible implementation of that objective.
Every measurement instrument requires calibration against known standards. In Root Signals, evaluators undergo rigorous calibration to ensure their scores align with human judgment baselines. This process involves:
Calibration datasets: Ground truth examples with expected scores, including optional justifications that illustrate the rationale for specific scores
Deviation analysis: Quantitative assessment using Root Mean Square to calculate total deviance between predicted and actual values
Continuous refinement: Iterative improvement based on calibration results, focusing on samples with highest deviation
Version control: Tracking evaluator performance across iterations
Production feedback loops: Adding real execution samples to calibration sets for ongoing improvement
The calibration principle acknowledges that LLM-based evaluators are probabilistic instruments requiring empirical validation. Calibration samples must be strictly separated from demonstration samples to ensure unbiased measurement.
All evaluations in Root Signals are fundamentally metric evaluations, producing normalized scores between 0 and 1. This universal approach provides:
Generalizability: Any evaluation concept can be expressed as a continuous metric
Optimization capability: Numeric scores enable gradient-based optimization
Fuzzy semantics handling: Real-world concepts exist on spectrums rather than binary states
Composability: Metrics can be combined, weighted, and aggregated
This principle recognizes that language and meaning are inherently fuzzy, requiring nuanced measurement approaches. Every evaluator maps text to a numeric value, enabling consistent measurement across diverse dimensions like coherence (logical consistency), conciseness (brevity without information loss), or harmlessness (absence of harmful content).
The platform maintains strict independence from specific model implementations, both for operational models (those being evaluated) and judge models (those performing evaluation). This enables:
Model comparison: Evaluate multiple models using identical criteria
Performance optimization: Select models based on accuracy, cost, and latency trade-offs
Future-proofing: Integrate new models as they become available
Vendor independence: Avoid lock-in to specific model providers
Changes in either operational or judge models can be measured precisely, enabling data-driven model selection. The platform supports API-based models (OpenAI, Anthropic), open-source models (Llama, Mistral), and custom locally-running models. Organization administrators control model availability, ensuring governance while maintaining flexibility.
Evaluation definitions must transcend platform boundaries through standardized, interchangeable formats. This principle ensures:
Clear entity references: Distinguish between evaluator references and definitions
Objective portability: Move evaluation criteria between systems
Implementation flexibility: Express objectives independent of specific implementations
Semantic preservation: Maintain meaning across different contexts
The distinction between referencing an entity and describing it enables robust system integration.
Complex evaluation predicates can be expressed either as a single (inherently composite) evaluator or decomposed into a vector of multiple independent evaluators that effectively indicate a dimension of measurement. This principle provides:
Granular calibration: Each dimension can be independently calibrated
Modular development: Evaluators can be developed and tested separately
Precise diagnostics: Identify which specific dimensions need improvement
Flexible composition: Combine dimensions based on use case requirements
For example, "helpfulness" might decompose into truthfulness, relevance, completeness, and clarity—each with its own evaluator and calibration set. This decomposition extends to specialized domains: RAG evaluators (faithfulness, context recall), structured output evaluators (JSON accuracy, property completeness), and task-specific evaluators (summarization quality, translation accuracy), etc. Judges represent practical implementations of this principle, stacking multiple evaluators to achieve comprehensive assessment.
Similar to evaluation objectives, an operational task should have an objective that defines its success criteria independent of implementation. An operational objective consists of:
Intent: The business purpose of the operation.
Success criteria: The set of evaluators that together define acceptable outcomes and what good looks like.
This set of evaluators can be captured in a judge, while the intent is capture in the Judge intent description.
Implementation independence: Multiple ways to achieve the objective
This principle extends the objective/implementation separation to operational workflows, enabling outcome-based task definition rather than prescriptive implementation.
The Root Evaluators are designed as a set of primitive, orthogonal measurement dimensions that minimize overlap while maximizing coverage. This principle ensures:
Minimal redundancy: Each evaluator measures a distinct semantic dimension
Maximal composability: Evaluators combine cleanly without interference
Complete coverage: The primitive set spans the space of common evaluation needs
Predictable composition: Combining evaluators yields intuitive results
This orthogonality enables judges to be constructed as precise combinations of primitive evaluators. For instance, "professional communication quality" might combine:
Clarity (information structure)
Formality (tone appropriateness)
Precision (technical accuracy)
Grammar correctness (linguistic quality)
Each dimension contributes independently, allowing fine-grained control over the composite evaluation. The orthogonal design prevents double-counting of features and ensures that improving one dimension doesn't inadvertently degrade another. In cases where one can arguably interpret the evaluator in several different ways, we split these into separate objectives and corresponding Root Evaluators, such as in the case of relevance which may or may not be interpreted to include truthfulness (for instance, in a factual context, an untrue statement is arguably irrelevant, whereas in a story or hypothetical context, this may not be the case).
These principles manifest throughout the Root Signals platform:
Evaluator creation starts with objective definition before implementation
Calibration workflows ensure measurement reliability
Judge composition allows stacking evaluators for complex assessments
Version control tracks both objectives and implementations
API design separates concerns between what and how
By adhering to these principles, Root Signals provides a semantically rigorous foundation for AI evaluation that scales from simple metrics to complex operational workflows.
Integrate Root Signals evaluations with Google Cloud's Vertex AI Agent Builder to monitor and improve your conversational AI agents in real-time.
[Vertex AI Agent Builder]
|
|—→ [Webhook call (to Cloud Function / Cloud Run)]
|
|—→ [Root Signals API]
|
|—→ [Evaluate response]
|
[Log result / augment reply]
|
←——————— Reply to Agent Builder user
Go to "Manage Fulfillment" in the Agent Builder UI.
Create a webhook (can be a Cloud Function, Cloud Run, or any HTTP endpoint).
This webhook will receive request
and response
pairs from user interactions.
This endpoint will:
Receive user input and the LLM response.
Construct an evaluator call to Root Signals API.
Send the result back as part of the webhook response (optional).
Option 1: Using Built-in Evaluators
app.post('/evaluate', async (req, res) => {
const userInput = req.body.sessionInfo.parameters.input;
const modelResponse = req.body.fulfillmentResponse.messages[0].text.text[0];
// Use a built-in evaluator (e.g., Relevance)
const evaluatorPayload = {
request: userInput,
response: modelResponse,
};
const evaluatorResult = await fetch('https://api.app.rootsignals.ai/v1/skills/evaluator/execute/YOUR_EVALUATOR_ID/', {
method: 'POST',
headers: {
'Authorization': 'Api-Key YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify(evaluatorPayload),
});
const result = await evaluatorResult.json();
console.log('Evaluator Score:', result.score);
// Return modified response (if needed)
res.json({
fulfillment_response: {
messages: [
{
text: {
text: [
`${modelResponse} (Quality score: ${result.score.toFixed(2)})`
]
}
}
]
}
});
});
Option 2: Using Custom Judges
app.post('/evaluate', async (req, res) => {
const userInput = req.body.sessionInfo.parameters.input;
const modelResponse = req.body.fulfillmentResponse.messages[0].text.text[0];
// Use a custom judge
const judgePayload = {
request: userInput,
response: modelResponse,
};
const judgeResult = await fetch('https://api.app.rootsignals.ai/v1/judges/YOUR_JUDGE_ID/execute/', {
method: 'POST',
headers: {
'Authorization': 'Api-Key YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify(judgePayload),
});
const result = await judgeResult.json();
console.log('Judge Score:', result.evaluator_results);
// Return modified response (if needed)
res.json({
fulfillment_response: {
messages: [
{
text: {
text: [
`${modelResponse} (Judge results: ${JSON.stringify(result.evaluator_results)})`
]
}
}
]
}
});
});
Built-in Evaluators:
Use evaluators like Relevance
, Precision
, Completeness
, Clarity
, etc.
Get available evaluators by logging in to https://app.rootsignals.ai/
Examples: Relevance, Truthfulness, Safety, Professional Writing
Custom Judges:
Create custom judges that combine multiple evaluators - use https://scorable.rootsignals.ai/ to generate a judge.
Judges provide aggregated scoring across multiple criteria
Coming Soon!
Root Signals enables several key workflows that transform how organizations measure, optimize, and control their AI applications. These flows represent common patterns for leveraging the platform's capabilities to achieve concrete outcomes.
In this flow, we transform a description of the workflow or measurement problem into a judge, consisting of a concrete set of evaluators that precisely measure success. The process involves:
Success Criteria Definition: Start with your business problem or use case description, and/or what dimensions of success matter for your specific context
Evaluator Selection: Map success criteria to specific evaluators from the Root Signals portfolio or create custom ones
Evaluator Construction: Create custom evaluators for key measurement targets
Judge Assembly: Combine selected evaluators into a coherent measurement strategy
Example: For a customer service chatbot, the problem "a chatbot for which we must ensure helpful and accurate responses" might decompose into:
Relevance evaluator (responses address the customer's question)
Completeness evaluator (all aspects of queries are addressed)
Politeness evaluator (maintaining professional tone)
Policy adherence evaluator (following company guidelines)
Evaluator-Driven Improvement of Prompts and Models for Operational Prompts
Given a set of evaluators, this flow systematically improves your AI application's performance:
Baseline Measurement: Evaluate current prompts and models against the evaluators
Variation Testing: Test different prompts, models, and configurations
Optimal Performance Selection: Choose the configuration that maximizes evaluator scores against costs, and latencies
Key considerations:
Balance accuracy improvements against cost increases
Consider latency requirements for real-time applications
Calibration Data-Driven Improvement of Predicates and Models for Evaluators
Given a calibration dataset, this flow systematically improves the performance of individual evaluators:
Baseline Measurement: Evaluate the current predicate and model against the calibration dataset
Variation Testing: Test different predicates, models, and configurations
Optimal Performance Selection: Choose the configuration that maximizes calibration scores against costs, and latencies
Key considerations:
Balance accuracy improvements against cost increases.
Consider latency requirements for real-time applications. Note some workflows are not sensitive to latency (email, offline agent operations)
Transform Existing Data into Actionable Insights
This flow applies evaluators to existing datasets or LLM input-ouput telemetry, enabling data quality assessment and filtering:
Data Ingestion: Load transcripts, chat logs, or other text data
Evaluator Application: Score each data point across the multiple evaluation dimensions
Metadata Enrichment: Attach scores as searchable metadata
Filtering and Analysis: Identify high/low quality samples, policy violations, or improvement opportunities
Applications:
Call center transcript analysis (clarity, policy alignment, customer satisfaction indicators)
Training data curation (identifying high-quality examples)
Compliance monitoring (detecting policy violations)
Quality assurance sampling (focusing review on problematic cases)
This flow creates a feedback loop that automatically improves content based on evaluation results:
Initial Evaluation: Score the original content with relevant evaluators
Feedback Generation: Extract scores and justifications from evaluators
Improvement Execution:
For LLM-generated content: Re-prompt the original model with evaluation feedback
For existing content: Pass to any LLM with improvement instructions based on evaluator feedback
Verification: Re-evaluate to confirm improvements
Use cases:
Iterative response refinement in production
Batch improvement of historical data
Automated content enhancement pipelines
Self-improving AI systems
This flow implements safety and quality controls by preventing substandard LLM outputs from reaching users:
Threshold Definition: Set minimum acceptable scores for critical evaluators
Real-Time Evaluation: Score LLM outputs before delivery
Conditional Blocking: Prevent responses that fall below thresholds from being served
Fallback Handling: Trigger alternative responses or escalation procedures for blocked content
Implementation strategies:
Critical evaluators: Harmlessness, confidentiality, policy adherence
Quality thresholds: Minimum coherence, relevance, or completeness scores
Graceful degradation: Provide safe default responses when blocking occurs
Logging and alerting: Track blocked responses for system improvement
Applications:
Customer-facing chatbots requiring brand safety
Healthcare AI with strict accuracy requirements
Financial services with regulatory compliance needs
Educational tools requiring age-appropriate content
Zero-Impact Monitoring of LLM Traffic
This flow enables comprehensive observability without affecting application performance:
Proxy Configuration: Route LLM traffic through Root Signals proxy
Automatic Capture: All requests and responses logged transparently
Asynchronous Processing of Evaluations: Evaluations occur out-of-band
Dashboard Visibility: Real-time metrics
Benefits:
No code changes required in application, only base_url update
Automatic request/response pairing
Built-in retry and error handling
Centralized configuration management
Asynchronous Logging: Send request/response pairs to Root Signals API
Non-Blocking Implementation: Use fire-and-forget pattern or background queues
Batching Strategy: Aggregate logs for efficient transmission
Resilient Design: Handle logging failures without affecting main flow
Benefits:
Full control over what gets logged
No network topology changes
Custom metadata enrichment
Selective logging based on business logic
Key considerations for both approaches:
Zero latency addition: Logging happens asynchronously
High-volume support: Handles production-scale traffic
Cost optimization: Sample high-volume, low-risk traffic
Agentic RAG with Root Signals Relevance Judge
Replication of Agentic RAG tutorial from LangGraph, where the decision of whether to use the retrieved content or not to answer a question is powered by Root Signals Evaluators.
The following is from LangGraph docs:
%%capture --no-stderr
%pip install -U --quiet langchain-community tiktoken langchain-openai langchainhub chromadb langchain langgraph langchain-text-splitters
import getpass
import os
def _set_env(key: str):
if key not in os.environ:
os.environ[key] = getpass.getpass(f"{key}:")
_set_env("OPENAI_API_KEY")
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import Annotated, Sequence, Literal
from typing_extensions import TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages
from langchain import hub
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from langgraph.prebuilt import tools_condition
from langchain.tools.retriever import create_retriever_tool
from langgraph.graph import END, StateGraph, START
from langgraph.prebuilt import ToolNode
import pprint
urls = [
"https://www.rootsignals.ai/post/evalops",
"https://www.rootsignals.ai/post/llm-as-a-judge-vs-human-evaluation",
"https://www.rootsignals.ai/post/root-signals-bulletin-january-2025",
]
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=100, chunk_overlap=50
)
doc_splits = text_splitter.split_documents(docs_list)
# Add to vectorDB
vectorstore = Chroma.from_documents(
documents=doc_splits,
collection_name="rag-chroma",
embedding=OpenAIEmbeddings(),
)
retriever = vectorstore.as_retriever()
retriever_tool = create_retriever_tool(
retriever,
"retrieve_blog_posts",
"Search and return information about Root Signals blog posts on LLM evaluation.",
)
tools = [retriever_tool]
class AgentState(TypedDict):
# The add_messages function defines how an update should be processed
# Default is to replace. add_messages says "append"
messages: Annotated[Sequence[BaseMessage], add_messages]
### Nodes
def agent(state):
"""
Invokes the agent model to generate a response based on the current state. Given
the question, it will decide to retrieve using the retriever tool, or simply end.
Args:
state (messages): The current state
Returns:
dict: The updated state with the agent response appended to messages
"""
print("---CALL AGENT---")
messages = state["messages"]
model = ChatOpenAI(temperature=0, streaming=True, model="gpt-4-turbo")
model = model.bind_tools(tools)
response = model.invoke(messages)
# We return a list, because this will get added to the existing list
return {"messages": [response]}
def rewrite(state):
"""
Transform the query to produce a better question.
Args:
state (messages): The current state
Returns:
dict: The updated state with re-phrased question
"""
print("---TRANSFORM QUERY---")
messages = state["messages"]
question = messages[0].content
msg = [
HumanMessage(
content=f""" \n
Look at the input and try to reason about the underlying semantic intent / meaning. \n
Here is the initial question:
\n ------- \n
{question}
\n ------- \n
Formulate an improved question: """,
)
]
# Grader
model = ChatOpenAI(temperature=0, model="gpt-4-0125-preview", streaming=True)
response = model.invoke(msg)
return {"messages": [response]}
def generate(state):
"""
Generate answer
Args:
state (messages): The current state
Returns:
dict: The updated state with re-phrased question
"""
print("---GENERATE---")
messages = state["messages"]
question = messages[0].content
last_message = messages[-1]
docs = last_message.content
# Prompt
prompt = hub.pull("rlm/rag-prompt")
# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, streaming=True)
# Post-processing
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Chain
rag_chain = prompt | llm | StrOutputParser()
# Run
response = rag_chain.invoke({"context": docs, "question": question})
return {"messages": [response]}
print("*" * 20 + "Prompt[rlm/rag-prompt]" + "*" * 20)
prompt = hub.pull("rlm/rag-prompt").pretty_print() # Show what the prompt looks like
Define the Decision-maker as a Root Judge
Now we define Root Signals Relevance evaluator as the decision maker for whether the answer should come from retrieved docs or not. The advantage of using Root Signals (as opposed to original LangGraph method) is:
We can control the relevance threshold because Root Signals evaluators always return a normalized score between 0
and 1
.
If we want, we can incorporate the Justification in the decision-making process.
The code is much shorter, i.e. about ⅓ of that of LangGraph tutorial.
from root import RootSignals
client = RootSignals()
def grade_relevance(state) -> Literal["generate", "rewrite"]:
"""
Determines whether the retrieved documents are relevant to the question.
Args:
state (messages): The current state
Returns:
str: A decision for whether the documents are relevant or not
"""
messages = state["messages"]
question = messages[0].content
docs = messages[-1].content
result = client.evaluators.Relevance(
request=question,
response=docs,
)
if result.score > 0.5: # we can control the threshold
return "generate"
else:
return "rewrite"
Rest of the tutorial is still from LangGraph:
# Define a new graph
workflow = StateGraph(AgentState)
# Define the nodes we will cycle between
workflow.add_node("agent", agent) # agent
retrieve = ToolNode([retriever_tool])
workflow.add_node("retrieve", retrieve) # retrieval
workflow.add_node("rewrite", rewrite) # Re-writing the question
workflow.add_node(
"generate", generate
) # Generating a response after we know the documents are relevant
# Call agent node to decide to retrieve or not
workflow.add_edge(START, "agent")
# Decide whether to retrieve
workflow.add_conditional_edges(
"agent",
# Assess agent decision
tools_condition,
{
# Translate the condition outputs to nodes in our graph
"tools": "retrieve",
END: END,
},
)
# Edges taken after the `action` node is called.
workflow.add_conditional_edges(
"retrieve",
# Assess agent decision
grade_relevance, # this is Root Signals evaluator
)
workflow.add_edge("generate", END)
workflow.add_edge("rewrite", "agent")
# Compile
graph = workflow.compile()
Our RAG Agent is ready:
inputs = {
"messages": [
("user", "What is EvalOps?"),
]
}
for output in graph.stream(inputs):
for key, value in output.items():
pprint.pprint(f"Output from node '{key}':")
pprint.pprint("---")
pprint.pprint(value, indent=2, width=80, depth=None)
pprint.pprint("\n---\n")
An evaluator is a metric for a piece of text that maps a string originating from a language model to a numeric value between 0 and 1. For example, an evaluator could measure the "Truthfulness" of the generated text.
Root Signals provides a rich collection of that you can use, such as:
Quality of professional writing: checks how grammatically correct, clear, concise and precise the output is
Completeness: evaluates how well the response addresses all aspects of the input request
Toxicity Detection: Identifies any toxic or inappropriate content
Faithfulness: Verifies the faithfulness of response with respect to a given context, acting as a hallucination detection, e.g. in RAG settings
Sentiment Analysis: Determines the overall sentiment (positive, negative, or neutral)
You can also define your own custom evaluators.
The objective of an evaluator consists of two components:
Intent: This describes the purpose and goal of the evaluator, specifying what it aims to evaluate or assess in the response.
Calibrator: It provides the ground truth set of appropriate numeric values for specific request-response pairs that defines the intended behavior of the evaluator. This set 'calibrates' its evaluation criteria and ensures consistent and accurate assessments.
The function of an evaluator consists of three components:
Prompt
Demonstrations
Model
The prompt (or instruction) defines the instructions and variable content the evaluator prompts a large language model with. It should clearly specify the criteria and guidelines for assessing the quality and performance of responses.
Note: During execution, the prompt defined by the user is appended to a more general template containing instructions responsible for guiding and optimizing the behavior of the evaluator. Thus the user does not have to bother with with generic instructions such as "Give a score between 0 and 1". It is sufficient to describe the evaluation criteria of the specific evaluator at hand.
Example: How well does the {{response}} adhere to instructions given in {{request}}.
All variable types are available for an evaluator. However, some restrictions apply.
The prompt of an evaluator must contain a special variable named response
that represents the LLM output to be evaluated.
It can also contain a special variable named request
if the prompt that produced the input is considered relevant for evaluation.
request
and response
can be either input or reference variables. In the latter case the variable is associated with a dataset that can be searched for contextual information to support the evaluation, using Retrieval Augment Generation.
A demonstration is a sample consisting of an response-request -pair (or just response, if request is not considered necessary for evaluation), an expected score, and optional justification. Demonstrations exemplify the expected behavior of the evaluator. Demonstration is provided to the model, and therefore must be strictly separated from calibration samples.
A justification illustrates the rationale for the given score. Justification can be helpful when the reason for a specific score is not obvious, allowing the model to pay attention to relevant aspects of the evaluated response and tackle ambiguous cases in a nuanced way.
Example:
A sample demonstration for an evaluator for determining if content is safe for children.
The model refers to the specific language model or engine used to execute the evaluator. It should be chosen based on its capabilities and suitability for the evaluation task..
Calibration is the response to the naturally arising question: How can we trust evaluation results? The calibrator provides a way to quantify the performance of the evaluator by providing the ground truth against which the evaluator can be gauged. The reference dataset that forms the calibrator defines the expected behaviour of the evaluator.
The samples of the calibration dataset are similar to the to those of the demonstration dataset, consisting of score, response, and optional request and optional justification.
On the Calibrator page:
The calibration dataset can be imported on a file or typed in the editor.
A synthetic dataset can be generated, edited, and appended.
Calibration page enables viewing the actual performance of the evaluator on the samples of the calibrator dataset with the expected performance defined by the scores of the set.
On this page:
Total deviance, which quantifies the average magnitude of the errors between predicted values by an evaluator and the actual observed values, can be calculated. A lower total deviance indicates that the evaluator's predictions are closer to the actual outcomes, which signifies better performance of the evaluator. The total deviance is computed using the Root Mean Square method.
Deviations for individual samples of the dataset are displayed, enabling easy identification of weak points of the evaluator. If a particular sample has a high deviation, there are characteristics in the sample that confuse the evaluator.
To improve the performance or 'calibrate' an evaluator, adjustments can be made to one or more of the three key components: the prompt, the demonstrations, and the model.
Effective strategies for this can be deduced by examining the calibration results. Inspecting the worst-performing samples, those with the largest deviations, can help identify the evaluator's weak points.
Then, one or more steps can be taken:
The instructions given in the prompt can be made more specific to adjust the behavior in the problem cases.
Modify demonstration content by adding examples similar to the problematic samples, which can enhance performance in these areas. Additional instructions can be added by including a justification to a demonstration. Note: Maintaining a sufficiently large calibration dataset reduces the risk of overfitting, i.e., producing an evaluator tailored to the calibration but lacking generalization.
The model can be changed. Overall performance can be improved by using a larger or otherwise better suited model, often at the cost of evaluation latency and price.
After each modification, it's advisable to recalculate the deviations to assess the direction and magnitude of the impact on performance.
As evaluators are special type of skill, , apply to evaluator skills too.
Evaluators tagged with RAG Evaluator work properly when evaluating skills with reference variables. Alternatively, when not used to evaluate skill outputs, contexts
parameter containing a set of documents as a list of strings—corresponding to the retrieved context data—must be passed.
Evaluators tagged with Ground Truth Evaluator can be used for evaluating tests sets that contain an expected_output
column. When used through the SDK, expected_output
parameter must be likewise passed.
Evaluators tagged with Function Call Evaluator can be used through SDK and require a functions
parameter conforming to OpenAI compatible tools parameter to be passed.
Relevance Assesses the relevance of the response in relation to the request by evaluating accuracy, completeness, adherence to the prompt, and logical consistency, to determine whether responses remain directly on-topic and informative.
Safety for Children Checks the appropriateness of content for young audiences, focusing on avoiding language or themes that could be harmful or unsuitable for children, thus promoting safety and age-appropriateness.
Sentiment Recognition Identifies the emotional tone of the response, determining whether it is predominantly positive, negative, or neutral, which helps in gauging the overall sentiment conveyed by the content.
Coherence Assesses whether the response is logically consistent and contextually relevant, checking for smooth transitions between ideas and the maintenance of topic focus throughout the response.
Conciseness Measures the brevity and directness of the response, aiming to convey information efficiently without the use of unnecessary words, thus avoiding redundancy.
Engagingness Evaluates the ability of the response to engage readers, determining whether the content is interesting and likely to hold attention, or if it tends toward being boring or predictable.
Originality Checks the originality and creativity of the response, focusing on providing unique and inventive perspectives rather than relying on clichéd or derivative ideas.
Clarity Measures how easily the response can be understood on a first reading, aiming for ideas to be presented in a logical, straightforward manner with sufficient detail for clarity.
Precision Assesses the accuracy and specificity of the response, aiming for responses that are detailed and exact, with no room for ambiguities or overly general statements.
Completeness Evaluates how well the response addresses all aspects of the input request, ensuring that no important elements are overlooked and that comprehensive coverage is provided for multi-faceted queries or instructions.
Persuasiveness Evaluates the persuasiveness of the response by examining the strength, clarity, and convincingness of the main argument or point, including how well it is supported by evidence, examples, and logical reasoning.
Confidentiality Assesses the response for the presence of any personal details, private communications, sensitive business data, or any other information that should be kept confidential, focusing on maintaining privacy and security.
Harmlessness Assesses the harmlessness of the response by checking for the presence of hate speech, slurs, derogatory language, and any incitement or glorification of violence. It evaluates the overall potential of the content to cause harm or distress.
Formality Evaluates the formality of the response by considering factors such as word choice, sentence structure, tone, grammar, and overall style. This helps in matching the content to the expected level of formality for the context.
Politeness Assesses the politeness of the response by examining factors such as word choice, tone, phrasing, and the overall level of respect and courtesy demonstrated in the response.
Helpfulness Evaluates the helpfulness of the response by considering how useful, informative, and beneficial the text is to a reader seeking information. Helpful text provides clear, accurate, relevant, and comprehensive information to aid the reader's understanding and ability to take appropriate action.
Non-toxicity Assesses the non-toxicity of the response. Text that is benign and completely harmless receives high scores.
Faithfulness RAG Evaluator This corresponds to hallucination detection in RAG settings. Measures the factual consistency of the generated answer with respect to the context. It determines whether the response accurately reflects the information provided in the context. This is the high-accuracy variant of our set of Faithfulness evaluators.
Faithfulness-swift RAG Evaluator
This is the faster variant of our set of Faithfulness evaluators.
Answer Relevance Measures how relevant a response is with respect to the prompt/query. Completeness and conciseness of the response are considered.
Truthfulness RAG Evaluator Assesses factual accuracy by prioritizing context-backed claims over model knowledge, while preserving partial validity for logically consistent but unverifiable claims. Unlike Faithfulness, allows for valid model-sourced information beyond the context. This is the high-accuracy variant of our set of Truthfulness evaluators.
Truthfulness-swift RAG Evaluator This is the faster variant of our set of Truthfulness evaluators.
Quality of Writing - Professional Measures the quality of writing as a piece of academic or other professional text. It evaluates the formality, correctness, and appropriateness of the writing style, aiming to match professional standards.
Quality of Writing - Creative Measures the quality of writing as a piece of creative text. It evaluates the creativity, expressiveness, and originality of the content, focusing on its impact and artistic expression.
JSON Content Accuracy RAG Evaluator | Function Call Evaluator Checks if the content of the JSON response is accurate and matches the documents and instructions, verifying that the JSON data correctly represents the intended information.
JSON Property Completeness Function Call Evaluator Checks how many of the required properties are present in the JSON response, verifying that all necessary fields are included. This is a string (non-LLM) evaluator.
JSON Property Type Accuracy Function Call Evaluator Checks if the types of properties in the JSON response match the expected types, verifying that the data types are correct and consistent. This is a string (non-LLM) evaluator.
JSON Property Name Accuracy Function Call Evaluator Checks if the names of properties in the JSON response match the expected names, verifying that the field names are correct and standardized. This is a string (non-LLM) evaluator.
JSON Empty Values Ratio Function Call Evaluator Checks the portion of empty values in the JSON response, aiming to minimize missing information and ensure data completeness. This is a string (non-LLM) evaluator.
Answer Semantic Similarity Ground Truth Evaluator Measures the semantic similarity between the generated answer and the ground truth, helping to evaluate how well the response mirrors the expected response.
Answer Correctness Ground Truth Evaluator Measures the factual correspondence of the generated response against a user-supplied ground truth. It considers both semantic similarity and factual consistency.
Context Recall RAG Evaluator | Ground Truth Evaluator Measures whether the retrieved context provides sufficient information to produce the ground truth response, evaluating if the context is relevant and comprehensive according to the expected output.
Context Precision RAG Evaluator | Ground Truth Evaluator Measures the relevance of the retrieved contexts to the expected output.
Summarization Quality Measures the quality of text summarization with high weights for clarity, conciseness, precision, and completeness.
Translation Quality Quality of machine translation with high weights for accuracy, completeness, fluency, and cultural appropriateness.
Planning Efficiency Quality of planning of an AI agent with high weights for efficiency, effectiveness, and goal-orientation.
Information Density Information density of a response with high weights for concise, factual statements and penalizing vagueness, questions, or evasive answers.
Reading Ease Evaluates the text for ease of reading, focusing on simple language, clear sentence structures, and overall clarity.
Answer Willingness Answer willingness of a response with high weights for response presence, directness and penalty for response avoidance, refusal, or evasion.
As our evaluators are LLM-judges, they are non-deterministic, i.e, the same input can result in slightly different score. We try to keep this fluctuation low. The expected standard deviations of each evaluator for 3 different dimensions are reported below: short/long context, single-turn / multi, low/high ground truth score:
Both ready-made Root Evaluators and your Custom Evaluators have version control. Normally, you can call an evaluator as :
and if you want to call a specific version, you can add:
Request: "Is there a refund option?",
Response: "Yes, there is a refund option available. According to clause 4.2 of the terms of business, if the engagement terminates within the first 3 months (except in cases of redundancy), a refund will be provided based on the schedule outlined in the document.",
Score: 1.00,
Justification: "While difficult and boring for children, the text does not involve unsafe elements.",
client.evaluators.run(
request="My internet is not working.",
response="""
I'm sorry to hear that your internet isn't working.
Let's troubleshoot this step by step. What is your IP address?
""",
evaluator_id="bd789257-f458-4e9e-8ce9-fa6e86dc3fb9", # e.g. corresponding to Relevance
)
evaluator_version_id="7c099204-4a41-4d56-b162-55aac24f6a47"
Coming in August 2025 - meanwhile, contact [email protected] for the full compliance & security documentation.
Documentation merging in progress - meanwhile, contact [email protected] for the full documentation
Root Signals builds with the philosophy of transparency with multiple open source projects. This roadmap is a living document about what we're working on and what's next. Root Signals is the world's most principled and powerful system for measuring the behavior of LLM based applications, agents and workflows.
Scorable is the automated LLM Evaluation Engineer agent for co-managing this platform with you.
Our visin is to create and auto-optimize the strongest automated knowledge process evaluation stack possible, with the least amount of effort and information from the user.
Maximum Automated Information Extraction
From user intent and/or provided example/instruction data, extract as much relevant information as possible.
Awareness of the information quality
Engage the user with the smallest amount of maximally impactful questions.
Maximally Powerful Evaluation Stack Generation
Build the most comprehensive and accurate evaluation capabilities possible, within the confines of data available.
Built for Agents
Maximum compatibility with autonomous agents and workflows.
Maximum Integration Surface
Seamless integration with all key AI frameworks.
EvalOps Principles for Long Term
Follow Root EvalOps Principles for evaluator lifecycle management.
✅ Automated Policy Adherence Judges
Create judges from uploaded policy documents and intents
✅ GDPR awareness of models (link)
Ability to filter out models not complying with GDPR
✅ Evaluator Calibration Data Synthesizer v1.0 (link)
In the evaluator drill-in view, expand your calibration dataset from 1 or more examples
✅ Evaluator version history and control to include all native Root Evaluators (link)
✅ Evaluator determinism benchmarks and standard deviations in reference datasets (link)
✅ Agent Evaluation MCP: stdio & SSE versions (link)
✅ Root Judge LLM 70B judge available for download and running in Root Signals for free!
Public Evaluation Reports
Generate HTML reports from any judge execution
TypeScript SDK
Rehashing of Example-driven Evaluation
Smoothly create the full judge from examples
Native Speech Evaluator API
Upload or stream audio directly to evaluators
Unified Experiments framework to Replace Skill Tests
Command Line Interface
Advanced Judge visibility controls
RBAC coverage on Judges (as in Evaluators, Skills and Datasets)
Output Refinement At-Origin
Refine your LLM outputs automatically based on scores
Agentic Classifier Generation 2.0
Create classifiers with the same robustness as metric evaluator stacks
Automatic Context Engineering
Refine your prompt templates automatically based on scores
Support all RAG evaluators
Improved Playground View
Agent Evaluation Pack 2.0
(Root Evaluator list expanding every 1-2 weeks, stay tuned)
Full OpenTelemetry Support
LiteLLM Direct Support
OpenRouter Support
(more coming)
Sync Judge & Evaluator Definitions to GitHub
Community Evals
Self-Hostable Evaluation Executor
Remote MCP Server
MCP Feature Extension Pack
Full judge feature access
Full log insights access
Reasoner-specific model parameters (incl. budget) in evaluators
(model support list continuously expanded, stay tuned)
More Planned Features coming as we sync our changelogs and the rest of the internal roadmap contents!
🐛 Bug Reports: GitHub Issues
📧 Enterprise Features: Contact [email protected]
💡 General: Discord
Last updated: 2025-06-30
Coming Soon!
To unlock full functionality, create a custom component to wrap the RS skill that supports Root Signals Validators
from typing import Dict
from typing import List
from haystack import component
from root import RootSignals
from root.validators import Validator
@component
class RootSignalsGenerator:
"""
Component to enable skill use
"""
def __init__(self, name: str, intent: str, prompt: str, model: str, validators: List[Validator]):
self.client = RootSignals()
self.skill = self.client.skills.create(
name=name,
intent=intent,
prompt=prompt,
model=model,
validators=validators,
)
For convenience, lets create another component to parse validation results
from root.generated.openapi_client.models.skill_execution_result import SkillExecutionResult
@component
class RootSignalsValidationResultParser:
@component.output_types(passed=bool, details=Dict[str, [str, float, bool]])
def run(self, replies: Dict[str, SkillExecutionResult]):
return {"passed": replies.validation['is_valid']}
We are now equipped to have any OpenAI compatible generator being replaced with a Validated one, based on the RootSignalsGenerator
component.
from haystack.dataclasses import ChatMessage
from haystack.core.pipeline.pipeline import Pipeline
from haystack.components.builders.dynamic_chat_prompt_builder import DynamicChatPromptBuilder
generator_A = RootSignalsGenerator(
name="My Q&A chatbot",
intent="Simple Q&A chatbot",
prompt="Provide a clear answer to the question: {{question}}",
model="gpt-4o",
validators=[Validator(evaluator_name="Clarity", threshold=0.6)]
)
pipeline = Pipeline(max_loops_allowed=1)
pipeline.add_component("prompt_builder", DynamicChatPromptBuilder())
pipeline.add_component("generator_A", generator_A)
pipeline.add_component("validation_parser", RootSignalsValidationResultParser())
pipeline.connect("prompt_builder.prompt", "generator_A.messages")
pipeline.connect("generator_A.replies", "validation_parser.replies")
prompt_template = """
Answer the question below.
Question: {{question}}
"""
result = pipeline.run(
{
"prompt_builder": {
"prompt_source": [ChatMessage.from_user(prompt_template)],
"template_variables": {
"question": "In the field of software development, what is the meaning and significance of 'containerization'? Use a popular technology as example. Cite sources where available."
},
}
},
include_outputs_from={
"generator_A",
"validation_parser",
},
)
{
'validation_parser': {'passed': True}, # use this directly, i.e. for haystack routers
'generator_A': { # full response from the generator, use llm_output for the plain response
'replies': SkillExecutionResult(
llm_output='Containerization in software development refers to the practice of encapsulating an application and its dependencies into a "container" that can run consistently across different computing environments. This approach ensures that the software behaves the same regardless of where it is deployed <truncated> \nSources:\n- Docker. "What is a Container?" Docker, https://www.docker.com/resources/what-container.\n- Red Hat. "What is containerization?" Red Hat, https://www.redhat.com/en/topics/containers/what-is-containerization.',
validation={'is_valid': True, 'validator_results': [{'evaluator_name': 'Clarity', 'evaluator_id': '603eae60-790b-4215-b6d3-301c16fc37c5', 'result': 0.85, 'threshold': 0.6, 'cost': 0.006645000000000001, 'is_valid': True, 'status': 'finished'}]},
model='gpt-4o',
execution_log_id='1fbdd6fc-f5a7-4e30-a7dc-15549b7557ec',
rendered_prompt="Provide a clear answer to the question: Answer the question below.\n \n Question: In the field of software development, what is the meaning and significance of 'containerization'? Use a popular technology as example. Cite sources where available.",
cost=0.003835)
}
}
Evaluation has been critical in all Machine Learning/AI systems for decades but it was not a problem until generative AI. Because back then, we had our test sets with ground truth annotations and simply calculated proportions/ratios (e.g. accuracy, precision, recall, AUROC, f1-score etc.) or well defined metric formulas (e.g. mean absolute error, some custom metric etc.) to estimate the performance of our systems. If we were satisfied with the accuracy & latency, we deployed our AI models to production.
We can not do that anymore because LLMs
output free text instead of pre-defined categories or numerical values
are non-deterministic
are instructable by semantic guidance in other words they have a prompt. Their behaviour depends on the which is difficult to predict beforehand.
Therefore, applications powered by LLMs are inherently unpredictable, unreliable, weird, and in general hard to control.
This is the main blocker of large scale adoption and value creation with Generative AI. To overcome this, we need a new way of measuring, monitoring, and guardrailing AI systems.
Yes, there are numerous LLM benchmarks and leaderboards, yet
They measure LLMs, not LLM applications. Benchmarks are interested in low-level academic metrics that are far away from business goals.
Tasks and samples in those benchmarks do not reflect real-life use cases. For example, multiple choice high school geometry question answering performance is not relevant when one is developing a customer support chatbot that should not hallucinate.
Benchmarks are full of low quality, incomplete, ambiguous, and erroneous samples
Data leakage is rampant. Consciously or not, test samples or slight variations of them are often leaked into the training data.
Benchmarks are not always transparent about what kind of settings they use in the experiments (e.g. temperature, zero-shot or few-shot, prompts) and are hard to replicate.
In short,
You want to measure and monitor your specific LLM-powered automation, not the generic academic capabilities of an LLM.
Datasets in Root Signals contain static information that can be included as context for skill execution. They allow you to provide additional data to your skills, such as information about your organization, products, customers, or any other relevant domain knowledge.
By leveraging data sets, you can enhance the capabilities of your skills and provide them with relevant domain knowledge or test data to ensure their performance and accuracy.
See SDK documentation.
To import a new data set:
Navigate to the Data Sets view.
Click the "Import Data Set" button on the top right corner of the screen.
Enter a name for your data set. If no name is provided, the file name will be used as the data set name.
Choose the data set type:
Reference Data: Used for skills that require additional context.
Test Data: Used for defining test cases and validating skill or evaluator performance.
Select a tag for the data set or create a new one.
Either upload a file or provide a URL from which the system can retrieve the data.
Preview the data set by clicking the "Preview" button on the bottom right corner.
Save the data set by clicking the "Submit" button.
Data sets can be linked to skills using reference variables. When defining a skill, you can choose a data set as a reference variable, and the skill will have access to that data set during execution. This allows you to provide additional context or information to the skill based on the selected data set.
When creating a new skill or an evaluator, you have the option to select a test data or a calibration data set, correspondingly, to drive the skill or evaluator with multiple predefined sequential inputs for the skill's performance evaluation.
Root Signals allows you to test your skill against multiple models simultaneously. In the "Prompts" and "Models" sections of the skill creation form, you can add multiple prompt variants and select one or more models to be tested, correspondingly. By clicking the "Test" / "Calibrate" button in the bottom right corner, the system will run tests using your selected test data set against each of the chosen prompts and models. This feature enables you to compare their performance and select the one with the best trade-offs for your use case.