1 of 45

Root Signals Product Documentation

Intro

What is it?

is a measurement, observability, and control platform for GenAI applications, automations, and agentic workflows powered by Large Language Models (LLMs). Such applications include chatbots, Retrieval Augmented Generation (RAG) systems, agents, data extractors, summarizers, translators, AI assistants, and various automations powered by LLMs.

Key Features

Any developer can use Root Signals to:

Add appropriate metrics such as Truthfulness, Answer Relevance, or Coherence of responses to any LLM pipeline and optimize their design choices (which LLM to use, prompt, RAG hyper-parameters, etc.) using these measurements:
- Log, record, and compare their changes and the corresponding measurements belonging to those
- Integrate metrics into CI/CD pipelines (e.g. GitHub actions) to prevent regressions
Turn those metrics into guardrails that prevent wrong, inappropriate, undesirable, or in general sub-optimal behaviour of their LLM apps simply by adding trigger thresholds. Monitor the performance in real-time, in production.
Create custom metrics for attributes ranging from 'mention of politics' to 'adherence to our communications policy document v3.1'.

Dashboard

The dashboard provides a comprehensive overview of the performance of your specific LLM applications:

Ready-Made Evaluators

Root Signals provides 30+ built-in, ready-to-use evaluators called Root Evaluators.

Custom Evaluators

Utilizing any LLM as a judge, you can create, benchmark, and tune Custom Evaluators.

Monitoring

We provide complete observability to your LLM applications through our Monitoring view.

Using Root Signals

Root Signals is available via

🖥️
SDKs
- 🐍
📑
🔌 (for Agents)

Root Signals can be used by individuals and organizations. Role-based Access Controls (RBAC), SLA, and security definitions are available for organizations. Enterprise customers also enjoy SSO signups via Okta and SAML.

Create a and get started in 30 seconds.

QUICK START

Getting started in 30 seconds

Scorable

Scorable is the automated LLM Evaluation Engineer agent for co-managing Root Signals platform with you.

From the App

To get started, sign up and login to Root Signals app. Select an evaluator under Evaluators tab and Execute. You will get a score between 0 and 1 and the justification for the score.

Programmatically

Create your Root Signals API key under Settings > Developer.

In Python

pip install root-signals

Quick test

For the best experience, we encourage you to create an account. However, if you prefer to run quick tests at this point, please create a temporary API key here.

Root Signals provides over 30 evaluators or judges, which you can use to score any text based on a wealth of metrics. You can attach evaluators to an existing application with just a few lines of code.

from root import RootSignals

# Just a quick test?
# You can get a temporary API key from https://app.rootsignals.ai/demo-user 
client = RootSignals(api_key="my-developer-key")
client.evaluators.Politeness(
    response="You can find the instructions from our Careers page."
)
# {score=0.7, justification='The response is st...', execution_log_id=...}

In Typescript

npm install @root-signals/typescript-sdk
# or
yarn add @root-signals/typescript-sdk
# or  
pnpm add @root-signals/typescript-sdk

and execute:

import { RootSignals } from '@root-signals/typescript-sdk';

// Connect to Root Signals API
const client = new RootSignals({
  apiKey: process.env.ROOTSIGNALS_API_KEY!
});

// Run any of our ready-made evaluators
const result = await client.evaluators.executeByName('Helpfulness', {
  response: "You can find the instructions from our Careers page."
});

Via REST API

You can execute evaluators in your favourite framework and tech stack via our REST API:

Via Command Line Interface (CLI)

roots judge execute <judge_id> --request "What is the capital of France?" --response "Paris"

Command Line Interface (CLI)

Model Context Protocol (MCP) Server

Our Model Context Protocol (MCP) equips your AI agents and agentic pipelines with evaluation capabilities.

Root Signals MCP

Evaluator Portfolio

List of well-calibrated Root evaluators include:

Relevance
Safety for Children
Sentiment Recognition
Coherence
Conciseness
Engagingness
Originality
Clarity
Precision
Persuasiveness
Confidentiality
Harmlessness
Formality
Politeness
Helpfulness
Non-toxicity
Faithfulness RAG Evaluator
Faithfulness-swift RAG Evaluator
Answer Relevance
Truthfulness RAG Evaluator
Truthfulness-swift RAG Evaluator
Quality of Writing - Professional
Quality of Writing - Creative
JSON Content Accuracy RAG Evaluator | Function Call Evaluator
JSON Property Completeness Function Call Evaluator
JSON Property Type Accuracy Function Call Evaluator
JSON Property Name Accuracy Function Call Evaluator
JSON Empty Values Ratio Function Call Evaluator
Answer Semantic Similarity Ground Truth Evaluator
Answer Correctness Ground Truth Evaluator
Context Recall RAG Evaluator | Ground Truth Evaluator
Context Precision RAG Evaluator | Ground Truth Evaluator
Summarization Quality
Translation Quality
Information Density
Reading Ease
Planning Efficiency
Answer Willingness

Details of each evaluator can be found .

OVERVIEW

Why Anything?

Why is Evals suddenly a thing now?

Evaluation has been critical in all Machine Learning/AI systems for decades but it was not a problem until generative AI. Because back then, we had our test sets with ground truth annotations and simply calculated proportions/ratios (e.g. accuracy, precision, recall, AUROC, f1-score etc.) or well defined metric formulas (e.g. mean absolute error, some custom metric etc.) to estimate the performance of our systems. If we were satisfied with the accuracy & latency, we deployed our AI models to production.

We can not do that anymore because LLMs

output free text instead of pre-defined categories or numerical values
are non-deterministic
are instructable by semantic guidance in other words they have a prompt. Their behaviour depends on the which is difficult to predict beforehand.

Therefore, applications powered by LLMs are inherently unpredictable, unreliable, weird, and in general hard to control.

This is the main blocker of large scale adoption and value creation with Generative AI. To overcome this, we need a new way of measuring, monitoring, and guardrailing AI systems.

But there are LLM benchmarks, no?

Yes, there are numerous LLM benchmarks and leaderboards, yet

They measure LLMs, not LLM applications. Benchmarks are interested in low-level academic metrics that are far away from business goals.
Tasks and samples in those benchmarks do not reflect real-life use cases. For example, multiple choice high school geometry question answering performance is not relevant when one is developing a customer support chatbot that should not hallucinate.
Benchmarks are full of low quality, incomplete, ambiguous, and erroneous samples
Data leakage is rampant. Consciously or not, test samples or slight variations of them are often leaked into the training data.
Benchmarks are not always transparent about what kind of settings they use in the experiments (e.g. temperature, zero-shot or few-shot, prompts) and are hard to replicate.

In short,

You want to measure and monitor your specific LLM-powered automation, not the generic academic capabilities of an LLM.

Concepts

Root Signals design philosophy starts from the principle of extreme semantic rigor. Briefly, this means making sure that, for example

The definitions and references of entities are tracked with maximal (and increasing) precision
Entities are assumed long-term and upgradeable
Entities are built for re-use
Changes will be auditable

Objective defines what you intend to achieve. It grounds an AI automation to the business target, such as providing a feature ('transform data source X into usable format Y') or a value ('suitability for use in Z').

Evaluator is a function that assigns a numeric value to a piece of content such as text, along a semantically defined dimension (truthfulness, relevance of an answer, coherence, etc.).

Judge is a stack of evaluators with a high-level Intent.

Model is the AI model such as an LLM that provides the semantic processing of the inputs. Notably, the list contains both API-based models such as OpenAI and Anthropic models, and open source models such as Llama and Mistral models. Finally, you can add your own locally running models to the list with ease. The organization Admin controls the availability of models enabled in your organization.

Principles

The Root Signals platform is built upon foundational principles that ensure semantic rigor, measurement accuracy, and operational flexibility. These principles guide the design and implementation of all platform features, from evaluator creation to production deployment.

1. Separation of Concerns: Objectives and Implementations

At the core of Root Signals lies a fundamental distinction between what should be measured and how it is measured:

An Objective defines the precise semantic criteria and measurement scale for evaluation.
An Evaluator represents an implementation that can meet these criteria.

This separation enables:

Multiple evaluator implementations for the same objective
Evolution of measurement techniques without changing business requirements
Clear communication between stakeholders about evaluation goals
Standardized benchmarking across different implementations

In practice, an objective consists of an Intent (describing the purpose and goal) and a Calibrator (the score-annotated dataset providing ground truth examples). The evaluator's function—comprised of prompt, demonstrations, and model—represents just one possible implementation of that objective.

2. Calibration and Measurement Accuracy

Every measurement instrument requires calibration against known standards. In Root Signals, evaluators undergo rigorous calibration to ensure their scores align with human judgment baselines. This process involves:

Calibration datasets: Ground truth examples with expected scores, including optional justifications that illustrate the rationale for specific scores
Deviation analysis: Quantitative assessment using Root Mean Square to calculate total deviance between predicted and actual values
Continuous refinement: Iterative improvement based on calibration results, focusing on samples with highest deviation
Version control: Tracking evaluator performance across iterations
Production feedback loops: Adding real execution samples to calibration sets for ongoing improvement

The calibration principle acknowledges that LLM-based evaluators are probabilistic instruments requiring empirical validation. Calibration samples must be strictly separated from demonstration samples to ensure unbiased measurement.

3. Metric-First Architecture

All evaluations in Root Signals are fundamentally metric evaluations, producing normalized scores between 0 and 1. This universal approach provides:

Generalizability: Any evaluation concept can be expressed as a continuous metric
Optimization capability: Numeric scores enable gradient-based optimization
Fuzzy semantics handling: Real-world concepts exist on spectrums rather than binary states
Composability: Metrics can be combined, weighted, and aggregated

This principle recognizes that language and meaning are inherently fuzzy, requiring nuanced measurement approaches. Every evaluator maps text to a numeric value, enabling consistent measurement across diverse dimensions like coherence (logical consistency), conciseness (brevity without information loss), or harmlessness (absence of harmful content).

4. Model Agnosticism and EvalOps

The platform maintains strict independence from specific model implementations, both for operational models (those being evaluated) and judge models (those performing evaluation). This enables:

Model comparison: Evaluate multiple models using identical criteria
Performance optimization: Select models based on accuracy, cost, and latency trade-offs
Future-proofing: Integrate new models as they become available
Vendor independence: Avoid lock-in to specific model providers

Changes in either operational or judge models can be measured precisely, enabling data-driven model selection. The platform supports API-based models (OpenAI, Anthropic), open-source models (Llama, Mistral), and custom locally-running models. Organization administrators control model availability, ensuring governance while maintaining flexibility.

5. Interoperability and Portability

Evaluation definitions must transcend platform boundaries through standardized, interchangeable formats. This principle ensures:

Clear entity references: Distinguish between evaluator references and definitions
Objective portability: Move evaluation criteria between systems
Implementation flexibility: Express objectives independent of specific implementations
Semantic preservation: Maintain meaning across different contexts

The distinction between referencing an entity and describing it enables robust system integration.

6. Dimensional Decomposition

Complex evaluation predicates can be expressed either as a single (inherently composite) evaluator or decomposed into a vector of multiple independent evaluators that effectively indicate a dimension of measurement. This principle provides:

Granular calibration: Each dimension can be independently calibrated
Modular development: Evaluators can be developed and tested separately
Precise diagnostics: Identify which specific dimensions need improvement
Flexible composition: Combine dimensions based on use case requirements

For example, "helpfulness" might decompose into truthfulness, relevance, completeness, and clarity—each with its own evaluator and calibration set. This decomposition extends to specialized domains: RAG evaluators (faithfulness, context recall), structured output evaluators (JSON accuracy, property completeness), and task-specific evaluators (summarization quality, translation accuracy), etc. Judges represent practical implementations of this principle, stacking multiple evaluators to achieve comprehensive assessment.

7. Operational Objectives

Similar to evaluation objectives, an operational task should have an objective that defines its success criteria independent of implementation. An operational objective consists of:

Intent: The business purpose of the operation.
Success criteria: The set of evaluators that together define acceptable outcomes and what good looks like.

This set of evaluators can be captured in a judge, while the intent is capture in the Judge intent description.

Implementation independence: Multiple ways to achieve the objective

This principle extends the objective/implementation separation to operational workflows, enabling outcome-based task definition rather than prescriptive implementation.

8. Orthogonality of the Root Evaluator Stack

The Root Evaluators are designed as a set of primitive, orthogonal measurement dimensions that minimize overlap while maximizing coverage. This principle ensures:

Minimal redundancy: Each evaluator measures a distinct semantic dimension
Maximal composability: Evaluators combine cleanly without interference
Complete coverage: The primitive set spans the space of common evaluation needs
Predictable composition: Combining evaluators yields intuitive results

This orthogonality enables judges to be constructed as precise combinations of primitive evaluators. For instance, "professional communication quality" might combine:

Clarity (information structure)
Formality (tone appropriateness)
Precision (technical accuracy)
Grammar correctness (linguistic quality)

Each dimension contributes independently, allowing fine-grained control over the composite evaluation. The orthogonal design prevents double-counting of features and ensures that improving one dimension doesn't inadvertently degrade another. In cases where one can arguably interpret the evaluator in several different ways, we split these into separate objectives and corresponding Root Evaluators, such as in the case of relevance which may or may not be interpreted to include truthfulness (for instance, in a factual context, an untrue statement is arguably irrelevant, whereas in a story or hypothetical context, this may not be the case).

Practical Implications

These principles manifest throughout the Root Signals platform:

Evaluator creation starts with objective definition before implementation
Calibration workflows ensure measurement reliability
Judge composition allows stacking evaluators for complex assessments
Version control tracks both objectives and implementations
API design separates concerns between what and how

By adhering to these principles, Root Signals provides a semantically rigorous foundation for AI evaluation that scales from simple metrics to complex operational workflows.

Usage Flows

Root Signals enables several key workflows that transform how organizations measure, optimize, and control their AI applications. These flows represent common patterns for leveraging the platform's capabilities to achieve concrete outcomes.

Flow 1: Explicit Decomposition Structure as the First Class Citizen

In this flow, we transform a description of the workflow or measurement problem into a judge, consisting of a concrete set of evaluators that precisely measure success. The process involves:

Success Criteria Definition: Start with your business problem or use case description, and/or what dimensions of success matter for your specific context
Evaluator Selection: Map success criteria to specific evaluators from the Root Signals portfolio or create custom ones
Evaluator Construction: Create custom evaluators for key measurement targets
Judge Assembly: Combine selected evaluators into a coherent measurement strategy

Example: For a customer service chatbot, the problem "a chatbot for which we must ensure helpful and accurate responses" might decompose into:

Relevance evaluator (responses address the customer's question)
Completeness evaluator (all aspects of queries are addressed)
Politeness evaluator (maintaining professional tone)
Policy adherence evaluator (following company guidelines)

Flow 2: Optimization Flow

Evaluator-Driven Improvement of Prompts and Models for Operational Prompts

Given a set of evaluators, this flow systematically improves your AI application's performance:

Baseline Measurement: Evaluate current prompts and models against the evaluators
Variation Testing: Test different prompts, models, and configurations
Optimal Performance Selection: Choose the configuration that maximizes evaluator scores against costs, and latencies

Key considerations:

Balance accuracy improvements against cost increases
Consider latency requirements for real-time applications

Calibration Data-Driven Improvement of Predicates and Models for Evaluators

Given a calibration dataset, this flow systematically improves the performance of individual evaluators:

Baseline Measurement: Evaluate the current predicate and model against the calibration dataset
Variation Testing: Test different predicates, models, and configurations
Optimal Performance Selection: Choose the configuration that maximizes calibration scores against costs, and latencies

Key considerations:

Balance accuracy improvements against cost increases.
Consider latency requirements for real-time applications. Note some workflows are not sensitive to latency (email, offline agent operations)

Flow 3: Offline Data Measurement and Scoring

Transform Existing Data into Actionable Insights

This flow applies evaluators to existing datasets or LLM input-ouput telemetry, enabling data quality assessment and filtering:

Data Ingestion: Load transcripts, chat logs, or other text data
Evaluator Application: Score each data point across the multiple evaluation dimensions
Metadata Enrichment: Attach scores as searchable metadata
Filtering and Analysis: Identify high/low quality samples, policy violations, or improvement opportunities

Applications:

Call center transcript analysis (clarity, policy alignment, customer satisfaction indicators)
Training data curation (identifying high-quality examples)
Compliance monitoring (detecting policy violations)
Quality assurance sampling (focusing review on problematic cases)

Flow 4: Automated Self-Improvement and Rectification with Evaluation Feedback**

This flow creates a feedback loop that automatically improves content based on evaluation results:

Initial Evaluation: Score the original content with relevant evaluators
Feedback Generation: Extract scores and justifications from evaluators
Improvement Execution:
- For LLM-generated content: Re-prompt the original model with evaluation feedback
- For existing content: Pass to any LLM with improvement instructions based on evaluator feedback
Verification: Re-evaluate to confirm improvements

Use cases:

Iterative response refinement in production
Batch improvement of historical data
Automated content enhancement pipelines
Self-improving AI systems

Flow 5: Guardrail Flow: Real-Time Protection Through Evaluation-Based Blocking

This flow implements safety and quality controls by preventing substandard LLM outputs from reaching users:

Threshold Definition: Set minimum acceptable scores for critical evaluators
Real-Time Evaluation: Score LLM outputs before delivery
Conditional Blocking: Prevent responses that fall below thresholds from being served
Fallback Handling: Trigger alternative responses or escalation procedures for blocked content

Implementation strategies:

Critical evaluators: Harmlessness, confidentiality, policy adherence
Quality thresholds: Minimum coherence, relevance, or completeness scores
Graceful degradation: Provide safe default responses when blocking occurs
Logging and alerting: Track blocked responses for system improvement

Applications:

Customer-facing chatbots requiring brand safety
Healthcare AI with strict accuracy requirements
Financial services with regulatory compliance needs
Educational tools requiring age-appropriate content

Flow 6: Lean Observation Flow

Zero-Impact Monitoring of LLM Traffic

This flow enables comprehensive observability without affecting application performance:

With Root Proxy (Simpler Implementation)

Proxy Configuration: Route LLM traffic through Root Signals proxy
Automatic Capture: All requests and responses logged transparently
Asynchronous Processing of Evaluations: Evaluations occur out-of-band
Dashboard Visibility: Real-time metrics

Benefits:

No code changes required in application, only base_url update
Automatic request/response pairing
Built-in retry and error handling
Centralized configuration management

Without Proxy (Direct Integration)

Asynchronous Logging: Send request/response pairs to Root Signals API
Non-Blocking Implementation: Use fire-and-forget pattern or background queues
Batching Strategy: Aggregate logs for efficient transmission
Resilient Design: Handle logging failures without affecting main flow

Benefits:

Full control over what gets logged
No network topology changes
Custom metadata enrichment
Selective logging based on business logic

Key considerations for both approaches:

Zero latency addition: Logging happens asynchronously
High-volume support: Handles production-scale traffic
Cost optimization: Sample high-volume, low-risk traffic

Trust Center

Coming in August 2025 - meanwhile, contact [email protected] for the full compliance & security documentation.

Roadmap

Root Signals builds with the philosophy of transparency with multiple open source projects. This roadmap is a living document about what we're working on and what's next. Root Signals is the world's most principled and powerful system for measuring the behaviour of LLM based applications, agents and workflows.

Scorable is the automated LLM Evaluation Engineer agent for co-managing this platform with you.

Vision

Our vision is to create and auto-optimize the strongest automated knowledge process evaluation stack possible, with the least amount of effort and information from the user.

Maximum Automated Information Extraction
- From user intent and/or provided example/instruction data, extract as much relevant information as possible.
Awareness of the information quality
- Engage the user with the smallest amount of maximally impactful questions.
Maximally Powerful Evaluation Stack Generation
- Build the most comprehensive and accurate evaluation capabilities possible, within the confines of data available.
Built for Agents
- Maximum compatibility with autonomous agents and workflows.
Maximum Integration Surface
- Seamless integration with all key AI frameworks.
EvalOps Principles for Long Term
- Follow Root EvalOps Principles for evaluator lifecycle management.
Principled Evaluator Infrastructure

All feedback is highly appreciated and often leads to immediate action. Submit new GitHub issues or vote on existing ones, so we can take quick action on what is important to you.

🚀 Recently Released

✅ TypeScript SDK
✅ Command Line Interface
✅ Automated Policy Adherence Judges
- Create judges from uploaded policy documents and intents
✅ GDPR awareness of models (link)
- Ability to filter out models not complying with GDPR
✅ Evaluator Calibration Data Synthesizer v1.0 (link)
- In the evaluator drill-in view, expand your calibration dataset from 1 or more examples
✅ Evaluator version history and control to include all native Root Evaluators (link)
✅ Evaluator determinism benchmarks and standard deviations in reference datasets (link)
✅ Agent Evaluation MCP: stdio & SSE versions (link)
✅ Root Judge LLM 70B judge available for download and running in Root Signals for free!
✅ Public Evaluation Reports - Generate HTML reports from any judge execution (link)
✅ Unified Experiments & Prompt Testing framework to Replace Skill Tests (link)

🏗️ Next Up

Agentic Classifier Generation 2.0
- Create classifiers with the same robustness as metric evaluator stacks
Support all RAG evaluators
Full OpenTelemetry Support
LiteLLM Direct Support: RS-supported vs. "Community-supported" models
OpenRouter Support
Sync Judge & Evaluator Definitions to GitHub
Agent Evaluation Pack 2.0
Rehashing of Example-driven Evaluation
- Smoothly create the full judge from examples
Native Speech Evaluator API
- Upload or stream audio directly to evaluators
Advanced Judge visibility controls
- RBAC coverage on Judges (as in Evaluators, Skills and Datasets)

🗓️ Planned

Scorable Features

Automatic Context Engineering
- Refine your prompt templates automatically based on scores

Core Platform Features

Improved Playground View

Root Evaluators

(Root Evaluator list expanding every 1-2 weeks, stay tuned)

Integrations

(more coming)

Developer Tools

(more coming)

Community & Deployment

Community Evals
Self-Hostable Evaluation Executor

MCP

Remote MCP Server
MCP Feature Extension Pack
- Full judge feature access
- Full log insights access

Models support

Reasoner-specific model parameters (incl. budget) in evaluators
(model support list continuously expanded, stay tuned)

More Planned Features coming as we sync our changelogs and the rest of the internal roadmap contents!

Feature Requests and Bug Reports:

🐛 Bug Reports: GitHub Issues

📧 Enterprise Features: Contact [email protected]

💡 General: Discord

Last updated: 2025-06-30

USAGE

Usage

Models

Models are the actual source of the intelligence. A model generally refers to both the type of model (such as GPT), the provider of the model (such as Azure), and the specific variant (such as GPT-4o). The models available on the Root Signals platform consist of:

Proprietary and hosted open-source models accessible via API. These models can be accessed via your API key or the Root Signals platform key (with corresponding billing responsibilities).
Open-source models provided by Root Signals.
Models added by your organization. See the Cookbook page for model details.

Control & Compliance

Some model providers are GDPR-compliant, ensuring data processing meets the General Data Protection Regulation requirements. However, please note that GDPR compliance by the provider does not necessarily mean that data is processed within the EU.

Organization admin can control the API keys and restrict access to a specific subset of models.

Objectives

Objectives consist of a human-readable Intent and ground truth examples directly. An objective serves both the purposes of

Communication: Expressing the intended business purpose of the evaluator
Coordination: Serving as a battery of measures

Intent Structure

Root Signals uses a standardized intent structure that bridges human-readable descriptions with machine-understandable syntax. This universal format ensures semantic consistency across all evaluators while maintaining clarity for human interpretation.

Standard Format

Property [of Object types [with respect to Reference objects]] [in Context] [for Goal] [with weights level1 for a₁, b₁, c₁ [; level2 for a₂, b₂, c₂]]

Components

Property: The single quality being measured (e.g., Relevance, Safety, Coherence)
Object types: The text artifacts being evaluated (response, content, answer, JSON)
Reference objects: What the evaluation compares against (request, prompt, ground truth)
Context: Specific situational constraints (child-audience, professional, RAG evaluation)
Goal: The desired outcome (keeping responses on-topic, age-appropriate consumption)
Weight levels: Criteria importance (high, avoiding, detecting)

Examples

Relevance Evaluator:

Relevance of response with respect to request for keeping responses on-topic and informative with weights high for accuracy, completeness, adherence to prompt, logical consistency

Safety for Children:

Safety of content in child-audience context for age-appropriate consumption with weights avoiding for explicit language, violent content, adult themes

Conciseness:

Conciseness of response for efficient communication with weights high for brevity, directness ; avoiding for redundancy

JSON Representation

The structured format also translates to machine-readable JSON:

This standardized approach ensures that every objective intent is both semantically precise and universally interpretable across different contexts and implementations.

Evaluators

An evaluator is a metric for a piece of text that maps a string originating from a language model to a numeric value between 0 and 1. For example, an evaluator could measure the "Truthfulness" of the generated text.

Root Signals provides a rich collection of pre-built evaluators that you can use, such as:

Quality of professional writing: checks how grammatically correct, clear, concise and precise the output is
Completeness: evaluates how well the response addresses all aspects of the input request
Toxicity Detection: Identifies any toxic or inappropriate content
Faithfulness: Verifies the faithfulness of response with respect to a given context, acting as a hallucination detection, e.g. in RAG settings
Sentiment Analysis: Determines the overall sentiment (positive, negative, or neutral)

You can also define your own custom evaluators.

Objective

The objective of an evaluator consists of two components:

Intent: This describes the purpose and goal of the evaluator, specifying what it aims to evaluate or assess in the response.
Calibrator: It provides the ground truth set of appropriate numeric values for specific request-response pairs that defines the intended behavior of the evaluator. This set 'calibrates' its evaluation criteria and ensures consistent and accurate assessments.

Function

The function of an evaluator consists of three components:

Prompt
Demonstrations
Model

Prompt

The prompt (or instruction) defines the instructions and variable content the evaluator prompts a large language model with. It should clearly specify the criteria and guidelines for assessing the quality and performance of responses.

Note: During execution, the prompt defined by the user is appended to a more general template containing instructions responsible for guiding and optimizing the behavior of the evaluator. Thus the user does not have to bother with with generic instructions such as "Give a score between 0 and 1". It is sufficient to describe the evaluation criteria of the specific evaluator at hand.

Example: How well does the {{response}} adhere to instructions given in {{request}}.

Variables in an evaluator

All variable types are available for an evaluator. However, some restrictions apply.

The prompt of an evaluator must contain a special variable named response that represents the LLM output to be evaluated.
It can also contain a special variable named request if the prompt that produced the input is considered relevant for evaluation.

request and response can be either input or reference variables. In the latter case the variable is associated with a dataset that can be searched for contextual information to support the evaluation, using Retrieval Augment Generation.

Demonstrations

A demonstration is a sample consisting of an response-request -pair (or just response, if request is not considered necessary for evaluation), an expected score, and optional justification. Demonstrations exemplify the expected behavior of the evaluator. Demonstration is provided to the model, and therefore must be strictly separated from calibration samples.

A justification illustrates the rationale for the given score. Justification can be helpful when the reason for a specific score is not obvious, allowing the model to pay attention to relevant aspects of the evaluated response and tackle ambiguous cases in a nuanced way.

Example:

A sample demonstration for an evaluator for determining if content is safe for children.

Request: "Is there a refund option?",
Response: "Yes, there is a refund option available. According to clause 4.2 of the terms of business, if the engagement terminates within the first 3 months (except in cases of redundancy), a refund will be provided based on the schedule outlined in the document.",
Score: 1.00,
Justification: "While difficult and boring for children, the text does not involve unsafe elements.",

Model

The model refers to the specific language model or engine used to execute the evaluator. It should be chosen based on its capabilities and suitability for the evaluation task..

Calibration

Calibration is the response to the naturally arising question: How can we trust evaluation results? The calibrator provides a way to quantify the performance of the evaluator by providing the ground truth against which the evaluator can be gauged. The reference dataset that forms the calibrator defines the expected behaviour of the evaluator.

The samples of the calibration dataset are similar to the to those of the demonstration dataset, consisting of score, response, and optional request and optional justification.

On the Calibrator page:

The calibration dataset can be imported on a file or typed in the editor.
A synthetic dataset can be generated, edited, and appended.

Calibration page enables viewing the actual performance of the evaluator on the samples of the calibrator dataset with the expected performance defined by the scores of the set.

On this page:

Total deviance, which quantifies the average magnitude of the errors between predicted values by an evaluator and the actual observed values, can be calculated. A lower total deviance indicates that the evaluator's predictions are closer to the actual outcomes, which signifies better performance of the evaluator. The total deviance is computed using the Root Mean Square method.
Deviations for individual samples of the dataset are displayed, enabling easy identification of weak points of the evaluator. If a particular sample has a high deviation, there are characteristics in the sample that confuse the evaluator.

How to improve the performance an evaluator

To improve the performance or 'calibrate' an evaluator, adjustments can be made to one or more of the three key components: the prompt, the demonstrations, and the model.

Effective strategies for this can be deduced by examining the calibration results. Inspecting the worst-performing samples, those with the largest deviations, can help identify the evaluator's weak points.

Then, one or more steps can be taken:

The instructions given in the prompt can be made more specific to adjust the behavior in the problem cases.
Modify demonstration content by adding examples similar to the problematic samples, which can enhance performance in these areas. Additional instructions can be added by including a justification to a demonstration. Note: Maintaining a sufficiently large calibration dataset reduces the risk of overfitting, i.e., producing an evaluator tailored to the calibration but lacking generalization.
The model can be changed. Overall performance can be improved by using a larger or otherwise better suited model, often at the cost of evaluation latency and price.

After each modification, it's advisable to recalculate the deviations to assess the direction and magnitude of the impact on performance.

Evaluator permissions

As evaluators are special type of skill, the permission controls that apply to all skills, apply to evaluator skills too.

List of Evaluators Maintained by Root Signals

Evaluators tagged with RAG Evaluator work properly when evaluating skills with reference variables. Alternatively, when not used to evaluate skill outputs, contexts parameter containing a set of documents as a list of strings—corresponding to the retrieved context data—must be passed.
Evaluators tagged with Ground Truth Evaluator can be used for evaluating tests sets that contain an expected_output column. When used through the SDK, expected_output parameter must be likewise passed.
Evaluators tagged with Function Call Evaluator can be used through SDK and require a functions parameter conforming to OpenAI compatible tools parameter to be passed.

Relevance Assesses the relevance of the response in relation to the request by evaluating accuracy, completeness, adherence to the prompt, and logical consistency, to determine whether responses remain directly on-topic and informative.
Safety for Children Checks the appropriateness of content for young audiences, focusing on avoiding language or themes that could be harmful or unsuitable for children, thus promoting safety and age-appropriateness.
Sentiment Recognition Identifies the emotional tone of the response, determining whether it is predominantly positive, negative, or neutral, which helps in gauging the overall sentiment conveyed by the content.
Coherence Assesses whether the response is logically consistent and contextually relevant, checking for smooth transitions between ideas and the maintenance of topic focus throughout the response.
Conciseness Measures the brevity and directness of the response, aiming to convey information efficiently without the use of unnecessary words, thus avoiding redundancy.
Engagingness Evaluates the ability of the response to engage readers, determining whether the content is interesting and likely to hold attention, or if it tends toward being boring or predictable.
Originality Checks the originality and creativity of the response, focusing on providing unique and inventive perspectives rather than relying on clichéd or derivative ideas.
Clarity Measures how easily the response can be understood on a first reading, aiming for ideas to be presented in a logical, straightforward manner with sufficient detail for clarity.
Precision Assesses the accuracy and specificity of the response, aiming for responses that are detailed and exact, with no room for ambiguities or overly general statements.
Completeness Evaluates how well the response addresses all aspects of the input request, ensuring that no important elements are overlooked and that comprehensive coverage is provided for multi-faceted queries or instructions.
Persuasiveness Evaluates the persuasiveness of the response by examining the strength, clarity, and convincingness of the main argument or point, including how well it is supported by evidence, examples, and logical reasoning.
Confidentiality Assesses the response for the presence of any personal details, private communications, sensitive business data, or any other information that should be kept confidential, focusing on maintaining privacy and security.
Harmlessness Assesses the harmlessness of the response by checking for the presence of hate speech, slurs, derogatory language, and any incitement or glorification of violence. It evaluates the overall potential of the content to cause harm or distress.
Formality Evaluates the formality of the response by considering factors such as word choice, sentence structure, tone, grammar, and overall style. This helps in matching the content to the expected level of formality for the context.
Politeness Assesses the politeness of the response by examining factors such as word choice, tone, phrasing, and the overall level of respect and courtesy demonstrated in the response.
Helpfulness Evaluates the helpfulness of the response by considering how useful, informative, and beneficial the text is to a reader seeking information. Helpful text provides clear, accurate, relevant, and comprehensive information to aid the reader's understanding and ability to take appropriate action.
Non-toxicity Assesses the non-toxicity of the response. Text that is benign and completely harmless receives high scores.
Faithfulness RAG Evaluator This corresponds to hallucination detection in RAG settings. Measures the factual consistency of the generated answer with respect to the context. It determines whether the response accurately reflects the information provided in the context. This is the high-accuracy variant of our set of Faithfulness evaluators.
Faithfulness-swift RAG Evaluator
This is the faster variant of our set of Faithfulness evaluators.
Answer Relevance Measures how relevant a response is with respect to the prompt/query. Completeness and conciseness of the response are considered.
Truthfulness RAG Evaluator Assesses factual accuracy by prioritizing context-backed claims over model knowledge, while preserving partial validity for logically consistent but unverifiable claims. Unlike Faithfulness, allows for valid model-sourced information beyond the context. This is the high-accuracy variant of our set of Truthfulness evaluators.
Truthfulness-swift RAG Evaluator This is the faster variant of our set of Truthfulness evaluators.
Quality of Writing - Professional Measures the quality of writing as a piece of academic or other professional text. It evaluates the formality, correctness, and appropriateness of the writing style, aiming to match professional standards.
Quality of Writing - Creative Measures the quality of writing as a piece of creative text. It evaluates the creativity, expressiveness, and originality of the content, focusing on its impact and artistic expression.
JSON Content Accuracy RAG Evaluator | Function Call Evaluator Checks if the content of the JSON response is accurate and matches the documents and instructions, verifying that the JSON data correctly represents the intended information.
JSON Property Completeness Function Call Evaluator Checks how many of the required properties are present in the JSON response, verifying that all necessary fields are included. This is a string (non-LLM) evaluator.
JSON Property Type Accuracy Function Call Evaluator Checks if the types of properties in the JSON response match the expected types, verifying that the data types are correct and consistent. This is a string (non-LLM) evaluator.
JSON Property Name Accuracy Function Call Evaluator Checks if the names of properties in the JSON response match the expected names, verifying that the field names are correct and standardized. This is a string (non-LLM) evaluator.
JSON Empty Values Ratio Function Call Evaluator Checks the portion of empty values in the JSON response, aiming to minimize missing information and ensure data completeness. This is a string (non-LLM) evaluator.
Answer Semantic Similarity Ground Truth Evaluator Measures the semantic similarity between the generated answer and the ground truth, helping to evaluate how well the response mirrors the expected response.
Answer Correctness Ground Truth Evaluator Measures the factual correspondence of the generated response against a user-supplied ground truth. It considers both semantic similarity and factual consistency.
Context Recall RAG Evaluator | Ground Truth Evaluator Measures whether the retrieved context provides sufficient information to produce the ground truth response, evaluating if the context is relevant and comprehensive according to the expected output.
Context Precision RAG Evaluator | Ground Truth Evaluator Measures the relevance of the retrieved contexts to the expected output.
Summarization Quality Measures the quality of text summarization with high weights for clarity, conciseness, precision, and completeness.
Translation Quality Quality of machine translation with high weights for accuracy, completeness, fluency, and cultural appropriateness.
Planning Efficiency Quality of planning of an AI agent with high weights for efficiency, effectiveness, and goal-orientation.
Information Density Information density of a response with high weights for concise, factual statements and penalizing vagueness, questions, or evasive answers.
Reading Ease Evaluates the text for ease of reading, focusing on simple language, clear sentence structures, and overall clarity.
Answer Willingness Answer willingness of a response with high weights for response presence, directness and penalty for response avoidance, refusal, or evasion.

Determinism

As our evaluators are LLM-judges, they are non-deterministic, i.e, the same input can result in slightly different score. We try to keep this fluctuation low. The expected standard deviations of each evaluator for 3 different dimensions are reported below: short/long context, single-turn / multi, low/high ground truth score:

Determinism Metrics

Version Control

Both ready-made Root Evaluators and your Custom Evaluators have version control. Normally, you can call an evaluator as :

client.evaluators.run(
    request="My internet is not working.",
    response="""
    I'm sorry to hear that your internet isn't working.
    Let's troubleshoot this step by step. What is your IP address?
    """,
    evaluator_id="bd789257-f458-4e9e-8ce9-fa6e86dc3fb9",  # e.g. corresponding to Relevance
)

and if you want to call a specific version, you can add:

evaluator_version_id="7c099204-4a41-4d56-b162-55aac24f6a47"

Judges

Judges are stacks of with their own high-level intent.

You can see the overview of your Judges in the app:

You can inspect a Judge in detail as well:

Via OpenAI-compatible Endpoint

cURL

Python

Datasets

Datasets in Root Signals contain static information that can be included as context for skill execution. They allow you to provide additional data to your skills, such as information about your organization, products, customers, or any other relevant domain knowledge.

By leveraging data sets, you can enhance the capabilities of your skills and provide them with relevant domain knowledge or test data to ensure their performance and accuracy.

Importing a Data Set via SDK

See SDK documentation.

Importing a Data Set via UI

To import a new data set:

Navigate to the Data Sets view.
Click the "Import Data Set" button on the top right corner of the screen.
Enter a name for your data set. If no name is provided, the file name will be used as the data set name.
Choose the data set type:
- Reference Data: Used for skills that require additional context.
- Test Data: Used for defining test cases and validating skill or evaluator performance.
Select a tag for the data set or create a new one.
Either upload a file or provide a URL from which the system can retrieve the data.
Preview the data set by clicking the "Preview" button on the bottom right corner.
Save the data set by clicking the "Submit" button.

Using Data Sets in Skills

Reference Data Sets

Data sets can be linked to skills using reference variables. When defining a skill, you can choose a data set as a reference variable, and the skill will have access to that data set during execution. This allows you to provide additional context or information to the skill based on the selected data set.

Test Data Sets

When creating a new skill or an evaluator, you have the option to select a test data or a calibration data set, correspondingly, to drive the skill or evaluator with multiple predefined sequential inputs for the skill's performance evaluation.

Root Signals allows you to test your skill against multiple models simultaneously. In the "Prompts" and "Models" sections of the skill creation form, you can add multiple prompt variants and select one or more models to be tested, correspondingly. By clicking the "Test" / "Calibrate" button in the bottom right corner, the system will run tests using your selected test data set against each of the chosen prompts and models. This feature enables you to compare their performance and select the one with the best trade-offs for your use case.

Permissions

Datasets in Root Signals contain static information that can be included as context for skill execution. They can contain information about your organization, products, customers, etc. Datasets are linked to skills using reference variables.

Access to datasets is controlled through permissions. By default, when a user uploads a new dataset, it is set to 'unlisted' status. Unlisted datasets are only visible to the user who created them and to administrators in the organization. This allows users to work on datasets privately until they are ready to be shared with others.

To make a dataset available to other users in the organization, the dataset owner or an administrator needs to change the status to 'listed'. Listed datasets are visible to all users in the organization and can be used in skills by anyone.

Dataset permissions do not control skill execution privilege

Note that dataset permissions control whether a dataset can be used in skill creation or skill editing as a reference variable or as a test data set. Unless more specific permissions information is made available via enterprise integrations, dataset permissions do not control who can use the data set in skill execution. I.e. once dataset in fixed to a skill as a reference variable, anyone who has privileges to execute the skill will also have implicit access to the data set through the skill execution.

It is important for dataset owners and administrators to carefully consider the sensitivity and relevance of datasets before making them widely available. Datasets may contain confidential or proprietary information that should only be accessible to authorized users.

Contact Root Signals for more fine-grained controls in enterprise, regulated or governmental contexts.

In Summary

The dataset permission system in Root Signals allows for granular control over who can access and use specific datasets. The unlisted/listed status toggle and the special privileges granted to administrators provide flexibility in managing data assets across the organization. Proper management of dataset permissions is crucial for ensuring data security and relevance in skill development and execution.

Prompt Testing

You can easily test different prompt and model combinations. Variable parametrization is also supported.

Results will show the LLM response, latency, cost and optionally the Evaluator scores:

Monitoring & Observability

Root Signals Monitoring View Features

Monitoring View

Logs

Main dashboard shows the high-level summary of all Evaluator and Judge executions in your organization.

Every single Evaluator and Judge execution result is logged. Each execution can also be tagged with labels and filtering & grouping with tags is supported.

Overall trend and summary for each Judge can also be found in Monitoring view.

HTML Reports

One can create a sharable HTML Report for any Judge execution by clicking the Generate HTML Report button.

which looks like:

Execution, Auditability and Versioning

The requests to any models wrapped within skill objects, and their responses, are traceable within the log objects or Root Signals platform.

The retention of logs is determined by your platform license. You may export logs at any point for your local storage. Access to execution logs is restricted based on your user role and skill-specific access permissions.

Objectives, evaluators, skills and test datasets are strictly versioned. The version history allows keeping track of all local changes that could affect the execution.

To understand reproducibility of pipelines of generative models, these general principles hold:

For any models, we can control for the exact inputs to the model, record the responses received, and the evaluator results of each run.
For open source models, we can pinpoint the exact version of the model (weights) being used, if this is guaranteed by the model provider, or if the provider is Root Signals.
For proprietary models whose weights are not available, we can pinpoint the version based on the version information given by the providers (such as gpt-4-turbo-2024-04-09) but we cannot guarantee those models are, in reality, fully immutable
Any LLM request with 'temperature' parameter above 0 is guaranteed not to be deterministic. Temperature = 0 and/or a fixed value of a 'seed' parameter usually mean the result is deterministic, but your mileage may vary.

Access Controls & Roles

Access to all Root Signals entities is controlled via user roles as well as entity-specific settings by the entity creator. The fundamental roles are the User and the Administrator.

Administrator privileges

Administrators have additional entity management privileges when it comes to management of datasets, objectives, skills and evaluators:

Administrators can see all the entities in the organization, including unlisted ones. This allows them to have an overview of all the data and functional assets in the organization.
Administrators can change the status of any entity such as a dataset from unlisted to listed and vice versa. This enables them to control which entities are shared with the wider organization.
Administrators can delete any entity, regardless of who created it. This is useful for managing obsolete or irrelevant entities.

Administrator also controls the accessibility of models across the organization, users, and billing.

Lifecycle Management

In Root Signals, evaluation is treated as a procedure to compute a metric grounded on a human-defined criteria, emphasizing the separation of utility grounding (Objective) and implementation (Evaluator function).

This lets the criteria and implementations for the evaluations evolve in two separate controlled and trackable tracks, each with different version control logic.

Metric evaluators are different from other entities in the world, and simply treating them as "grounded in data", on one hand, or as "tests", on the other, misses some of their core properties.

In Root Signals, an Objective consist of

Intent that is human-defined and human-understanable, corresponding to the precise attribute being meausred.
Calibration data set that defines, via examples, the structure and scale of those criteria.

An Evaluator function consists of:

Predicate that uniquely specifies the task to the LLMs that power the evaluator
LLM
In-context examples (demonstrations)
[Optionally] Associated data files

An Evaluator function is typically associated with an Objective that connects it to business / contextual value, but the two have no causal connection.

Root Signals platform itself handles:

Semantic quantization: Guaranteeing the predicates are consistently mapped to metrics (for supported LLMs). This lets us abstract the predicates out of the boilerplate prompts needed to yield robust metrics
Version control of evaluator implementations
Maintenance of relationships* between Objectives and Evaluators
Monitoring

E.g. If an Objective is changed (e.g. it's calibration dataset is altered), it is not a priori clear if the related criteria, which then affect all evaluator variants using the Objective, rendering measurements backwards-incompatible. Hence, the best-practise enforced by Root Signals platform is to create an entirely new Objective, so that it is clear the criteria have changed. This can be bypassed, however, when the Objective is still in formation stage and/or you accept that the criteria will change over time.

Over time, improved evaluator functions will be created (including but not limited to model updates) to improve upon the Objective targets. On the other hand, Objectives tend to branch and become more precise over time, passing the burden of resolving the question of "is this still the same Objective" to the users, while providing the software support to make those calls either way in an auditable and controllable manner.

Cookbook

Advanced use cases and common recipes

Use a Judge

Once you have created your Judge using the , you can integrate it into your application through various methods, including SDKs, API, or CLI.

Since OpenAI's API serves as the lingua franca of LLMs, it's one of the most popular ways to use Judges. Root Signals provides an easy integration method by simply changing your base URL to point to the Root Signals OpenAI proxy.

🔍 Run a Judge to evaluate the quality of returns policy claims

Let's walk through an example where we have a Judge that evaluates the quality of returns policy claims.

The response will include the Judge's evaluation results in the model_extra field:

Background Execution

You can also run the Judge in the background and check the results later through the monitoring dashboard:

Judges automatically run in the background when you stream responses.

✨ Use the Judge to improve model responses automatically

By switching the base URL to the Root Signals OpenAI proxy refine endpoint, you can use the Judge to improve the model responses automatically.

Here, based on the Judge's evaluation, the Root Signals platform will ensure that the model response aligns with the safeguards you have configured in the Judge.

Summary

Integrating Judges into your application is straightforward — simply change the base URL in your existing OpenAI client configuration. Root Signals supports all major LLMs and providers, and you can bring your own models if needed.

Add a custom evaluator

Root Signals provides evaluators that fit most needs, but you can add custom evaluators for specific needs. In this guide, we will add a custom evaluator and tune its performance using demonstrations.

Example: Weasel words

Consider a use case where you need to evaluate a text based on its number of weasel words or ambiguous phrases. Root Signals provides the optimized Precision evaluator for this, but let's build something similar to go through the evaluator-building process.

Navigate to the Evaluator Page:
- Go to the evaluator page and click on "New Evaluator."
Name Your Evaluator:
- Type the name for the evaluator, for example, "Direct language."
Define the Intent:
- Give the evaluator an intent, such as "Ensures the text does not contain weasel words."
Create the Prompt:
- "Is the following text clear and has no weasel words"
Add a placeholder (variable) for the text to evaluate:
- Click on the "Add Variable" button to add a placeholder for the text to evaluate.
  - E.g., "Is the following text clear and has no weasel words: {{response}}"
Select the Model:
- Choose the model, such as gpt-4-turbo, for this evaluation.
Save and Test the Evaluator:
- Click Create evaluator and begin experimenting with it.

Improve the custom evaluator performance

You can add demonstrations to the evaluator to tune its scores to match more closely to the desired behavior.

Example: Improve the Weasel words evaluator

Let's penalize using the word "probably"

Go to the Weasel words evaluator and click Edit
Click Add under Demonstrations section
Add a demonstration
- Type to the Response field: "This solution will probably work for most users."
- Score: 0,1
Save the evaluator and try it out

Note that adding more demonstrations, such as

"The project will probably be completed on time."
"We probably won't need to make any major changes."
"He probably knows the answer to your question."
"There will probably be a meeting tomorrow."
"It will probably rain later today."

will further adjust the evaluator's behavior. Refer to the full evaluator documentation for more information.

Add a calibration test set

To ensure the reliability of the Direct Language evaluator, you can create and use test data, referred to as a calibration dataset. A calibration set is a collection of LLM outputs, prompts, and expected scores that serve as benchmarks for evaluator performance.

1. Attaching a Calibration Set

Start by attaching an empty calibration set to the evaluator:

Navigate to the Direct Language evaluator page and click Edit.
Select the Calibration section and click Add Dataset.
Name the dataset (e.g., “Direct Language Calibration Set”).

Optionally, add sample rows, such as:

"0,2","I am pretty sure that is what we need to do"

Click Save and close the dataset editor.
Optionally, click the Calibrate button to run the calibration set.
Save the evaluator

2. Adding Production Samples to the Calibration Set

You can enhance your calibration set using real-world data from evaluator runs stored in the execution log.

Go to the Execution Logs page.
Locate a relevant evaluator run and click on it.
Click Add to Calibration Dataset to include its output and score in the calibration set.

By regularly updating and running the calibration set, you safeguard the evaluator against unexpected behavior, ensuring its continued accuracy and reliability.

Evaluate an LLM response

Building production-ready and reliable AI applications requires safeguards provided by an evaluation layer. LLM responses can vary drastically based on even the slightest input changes.

Root Signals provides a robust set of fundamental evaluators suitable for any LLM-based application.

Setup

You need a few examples of LLM outputs (text). Those can be from any source, such as a summarization output on a given topic.

Running an evaluator through the UI

The evaluators listing page shows all evaluators at your disposal. Root Signals provides the base evaluators, but you can also build custom evaluators for specific needs.

Let's start with the Precision evaluator. Based on the text you want to evaluate, feel free to try other evaluators as well.

Click on the Precision evaluator and then click on the Execute skill button.
Paste the text you want to evaluate into the output field and click Execute. You will get a numeric score based on the metric the evaluator is evaluating and the text to evaluate.

An individual score is not very interesting. The power of evaluation lies in integrating evaluators into an LLM application.

Integrating evaluators as part of existing AI automation

Integrating the evaluators as part of your LLM application is a more systematic approach to evaluating LLM outputs. That way, you can compare the scores over time and take action based on the evaluation results.

The Precision evaluator details page contains information on how to add it to your application. First, you must fetch a Root Signals API key and then execute the example cURL command.

Go to the Precision evaluator details page
Click on the Add to your application link
Copy the cURL command

You can omit the request field from the data payload and add the text to evaluate in the response field. Example (cURL)

curl 'https://api.app.rootsignals.ai/v1/skills/evaluator/execute/767bdd49-5f8c-48ca-8324-dfd6be7f8a79/' \
                                                   -H 'authorization: Api-Key <YOUR API KEY>' \
                                                   -H 'content-type: application/json' \
                                                   --data-raw '{"response":"While large language models (LLMs) have many powerful applications, there are scenarios where they are not as effective or suitable. Here are some use cases where LLMs may not be useful:\n\nReal-Time Critical Systems:\nLLMs are not ideal for applications requiring real-time, critical decision-making, such as air traffic control, medical emergency systems, or autonomous vehicle navigation, where delays or errors can have severe consequences.\n\nHighly Specialized Expert Tasks:\nTasks that require deep domain-specific expertise, such as advanced scientific research, complex legal analysis, or detailed medical diagnosis, may be beyond the capabilities of LLMs due to the need for precise, highly specialized knowledge and judgment."}'

Example (Python SDK)

# pip install root-signals
from root import RootSignals

client = RootSignals(api_key="<YOUR API KEY>")
client.evaluators.Precision(
    response="While large language models (LLMs) have many powerful applications, there are scenarios where they are not as effective or suitable. Here are some use cases where LLMs may not be useful:\n\nReal-Time Critical Systems:\nLLMs are not ideal for applications requiring real-time, critical decision-making, such as air traffic control, medical emergency systems, or autonomous vehicle navigation, where delays or errors can have severe consequences.\n\nHighly Specialized Expert Tasks:\nTasks that require deep domain-specific expertise, such as advanced scientific research, complex legal analysis, or detailed medical diagnosis, may be beyond the capabilities of LLMs due to the need for precise, highly specialized knowledge and judgment."
)

RAG evaluation

Root Signals provides evaluators for Retrieval Augmented Generation (RAG) use cases, where you can give the context as part of the evaluated content.

Hallucination Detection

One of the most useful evaluators in RAG settings is Faithfulness which detects claims that can not be deducted from the context, i.e., hallucinations in RAG setup.

Here is an example of running a hallucination check using the Python SDK:

from root import RootSignals

client = RootSignals()

request = "Is the number of pensioners working more than 100k in 2023?"
response = "Yes, 150000 pensioners were working in 2024."

# Chunks retreived from a RAG pipeline
retreived_document_1 = """
While the work undertaken by seniors is often irregular and part-time, more than 150,000 pensioners were employed in 2023, the centre's statistics reveal. The centre noted that pensioners have increasingly continued to work for some time now.
"""
retreived_document_2 = """
According to the pension centre's latest data, a total of around 1.3 million people in Finland were receiving old-age pensions, with average monthly payments of 1,948 euros.
"""

# Measures is the answer faithful to my contexts (knowledge-base/documents)
faithfulness_result = client.evaluators.Faithfulness(
    request=request,
    response=response,
    contexts=[retreived_document_1, retreived_document_2],
)

print(faithfulness_result.score)  # 0.0 as the response does not match the retrieved documents
print(faithfulness_result.justification)

Another such evaluator is the Truthfulness evaluator, which measures the factual consistency of the generated answer against the given context as well as general knowledge.

Here is an example of running the Truthfulness evaluator:

result = client.evaluators.Truthfulness(
    request="What was the revenue in Q1/2023",
    response="The revenue in the last quarter was 5.2 M USD",
    contexts=[
        "Financial statement of 2023"
        "2023 revenue and expenses...",
    ],
)
print(result.score)
# 0.5

For other RAG evaluators, refer to our Evaluator Portfolio page.

Connect a model

Root Signals subscription provides you can use in any of your skills. You are not limited by that selection, though. Integrating with cloud providers' models or connecting to locally hosted models is possible or the REST API.

Huggingface example

To use an , add the model endpoint via the SDK or through the REST API

After adding the model, you can use it like any other model in your skills and evaluators.

Comprehensively Testing Your LLM Code

Overview

Root Signals provides a multi-dimensional testing framework that ensures your LLM applications perform reliably across response quality, security, performance, and messaging alignment. This systematic approach helps you identify and prevent issues before they impact production.

Testing Dimensions

1. Response Quality

Correctness and Accuracy

Factual accuracy validation
Context relevance assessment
Coherence and consistency checks
Completeness verification

Implementation:

from root import RootSignals

client = RootSignals(api_key="your-api-key")

# Test response quality with multiple evaluators
relevance_result = client.evaluators.Relevance(
    request="What is the capital of France?",
    response="The capital of France is Paris, which is located in the north-central part of the country."
)

coherence_result = client.evaluators.Coherence(
    request="Explain machine learning",
    response="Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed."
)

completeness_result = client.evaluators.Completeness(
    request="List the benefits of renewable energy",
    response="Renewable energy reduces carbon emissions, lowers long-term costs, and provides energy independence."
)

2. Security & Privacy

Content Safety

Harmlessness validation
Toxicity detection
Child safety assessment

Implementation:

# Security-focused evaluators
safety_result = client.evaluators.Harmlessness(
    request="How do I protect my personal information online?",
    response="To protect your personal information online, use strong passwords, enable two-factor authentication, and be cautious about sharing sensitive data."
)

toxicity_result = client.evaluators.Non_toxicity(
    request="What do you think about this situation?",
    response="I understand your frustration, and I'd be happy to help you find a solution."
)

child_safety_result = client.evaluators.Safety_for_Children(
    request="Tell me about animals",
    response="Animals are fascinating creatures that live in many different environments around the world."
)

3. Performance & Effectiveness

Response Quality Metrics

Helpfulness assessment
Clarity evaluation
Precision measurement

Implementation:

# Performance-focused evaluators
helpfulness_result = client.evaluators.Helpfulness(
    request="I need help setting up my email",
    response="I'd be happy to help you set up your email. First, let's identify which email provider you're using..."
)

clarity_result = client.evaluators.Clarity(
    request="Explain quantum computing",
    response="Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously, enabling parallel processing of information."
)

precision_result = client.evaluators.Precision(
    request="What is the population of Tokyo?",
    response="The population of Tokyo is approximately 14 million people in the metropolitan area."
)

4. Messaging Alignment

Communication Style

Tone and formality validation
Politeness assessment
Persuasiveness measurement

Implementation:

# Messaging alignment evaluators
politeness_result = client.evaluators.Politeness(
    request="I want to return this product",
    response="I'd be happy to help you with your return. Let me walk you through the process."
)

formality_result = client.evaluators.Formality(
    request="Please provide the quarterly report",
    response="The quarterly report has been prepared and is attached for your review."
)

persuasiveness_result = client.evaluators.Persuasiveness(
    request="Why should I choose your service?",
    response="Our service offers 24/7 support, competitive pricing, and a proven track record of customer satisfaction."
)

Testing Approaches

Single Evaluator Testing

Basic Evaluation

# Test a single response with one evaluator
result = client.evaluators.Truthfulness(
    request="What was the revenue in Q1 2023?",
    response="The revenue in Q1 2023 was 5.2 million USD.",
    contexts=[
        "Financial statement of 2023: Q1 revenue was 5.2M USD",
        "2023 revenue and expenses report"
    ]
)

print(f"Score: {result.score}")
print(f"Justification: {result.justification}")

Multi-Evaluator Testing with Judges

Judge-Based Evaluation

# Use judges to run multiple evaluators together
judge_result = client.judges.run(
    judge_id="your-judge-id",
    request="What are the benefits of our product?",
    response="Our product offers excellent value, superior quality, and outstanding customer support."
)

# Process multiple evaluator results
for eval_result in judge_result.evaluator_results:
    print(f"{eval_result.evaluator_name}: {eval_result.score}")
    print(f"Justification: {eval_result.justification}")

RAG-Specific Testing

Context-Aware Evaluation

# Test RAG responses with context  
rag_result = client.evaluators.Faithfulness(
    request="What is our return policy?",
    response="Customers can return items within 30 days of purchase for a full refund.",
    contexts=[
        "Company return policy: 30-day return window",
        "Customer service guidelines: Full refunds within 30 days"
    ]
)

context_precision = client.evaluators.Context_Precision(
    request="What is our return policy?",
    response="Items can be returned within 30 days for a full refund.",
    contexts=[
        "Return policy: 30-day return window with full refund",
        "Shipping policy: 3-5 business days delivery"
    ],
    expected_output="Items can be returned within 30 days for a full refund."
)

Ground Truth Testing

Expected Output Validation

# Test against expected answers
correctness_result = client.evaluators.Answer_Correctness(
    request="What is 2 + 2?",
    response="2 + 2 equals 4",
    expected_output="4"
)

similarity_result = client.evaluators.Answer_Semantic_Similarity(
    request="Explain photosynthesis",
    response="Photosynthesis is the process where plants convert sunlight into energy",
    expected_output="Plants use sunlight to create energy through photosynthesis"
)

Testing Methodologies

Batch Testing Function

def batch_evaluate_responses(test_cases, evaluators):
    """
    Evaluate multiple test cases with multiple evaluators
    """
    results = []
    
    for test_case in test_cases:
        case_results = {}
        
        for evaluator_name in evaluators:
            try:
                # Get evaluator method by name
                evaluator_method = getattr(client.evaluators, evaluator_name)
                
                # Run evaluation
                result = evaluator_method(
                    request=test_case["request"],
                    response=test_case["response"],
                    contexts=test_case.get("contexts", [])
                )
                
                case_results[evaluator_name] = {
                    "score": result.score,
                    "justification": result.justification
                }
            except Exception as e:
                case_results[evaluator_name] = {
                    "error": str(e),
                    "score": None
                }
        
        results.append({
            "test_case": test_case,
            "results": case_results
        })
    
    return results

# Example usage
test_cases = [
    {
        "request": "What is machine learning?",
        "response": "Machine learning is a type of AI that learns from data",
        "contexts": ["AI textbook chapter on machine learning"]
    },
    {
        "request": "How do I reset my password?",
        "response": "Click the 'Forgot Password' link on the login page",
        "contexts": ["User manual: password reset instructions"]
    }
]

evaluators = ["Relevance", "Clarity", "Helpfulness", "Truthfulness"]
batch_results = batch_evaluate_responses(test_cases, evaluators)

Regression Testing

def regression_test(baseline_results, current_results, threshold=0.05):
    """
    Compare current results against baseline to detect regressions
    """
    regressions = []
    
    for evaluator in baseline_results:
        baseline_score = baseline_results[evaluator]["score"]
        current_score = current_results[evaluator]["score"]
        
        if current_score < baseline_score - threshold:
            regressions.append({
                "evaluator": evaluator,
                "baseline_score": baseline_score,
                "current_score": current_score,
                "regression": baseline_score - current_score
            })
    
    return regressions

# Example usage
baseline = {
    "Relevance": {"score": 0.85},
    "Clarity": {"score": 0.78},
    "Helpfulness": {"score": 0.82}
}

current = {
    "Relevance": {"score": 0.83},
    "Clarity": {"score": 0.75},
    "Helpfulness": {"score": 0.84}
}

regressions = regression_test(baseline, current)
if regressions:
    print("Regressions detected:")
    for regression in regressions:
        print(f"  {regression['evaluator']}: {regression['regression']:.3f} drop")

Skills-Based Testing

Creating Test Skills

# Create a skill for testing
test_skill = client.skills.create(
    name="Customer Service Bot",
    intent="Provide helpful customer service responses",
    prompt="You are a helpful customer service agent. Answer the customer's question: {{question}}",
    model="gpt-4o",
    validators=[
        {"evaluator_name": "Politeness", "threshold": 0.8},
        {"evaluator_name": "Helpfulness", "threshold": 0.7},
        {"evaluator_name": "Clarity", "threshold": 0.75}
    ]
)

print(f"Created skill: {test_skill.id}")

Best Practices

Test Planning

Define Clear Objectives: Identify what aspects of your LLM application need testing
Select Appropriate Evaluators: Choose evaluators that match your testing goals
Prepare Representative Data: Use realistic test cases that reflect actual usage
Set Meaningful Thresholds: Establish score thresholds that align with quality requirements

Evaluation Design

Use Multiple Evaluators: Combine different evaluators for comprehensive assessment
Include Context When Relevant: Provide context for RAG evaluators
Test Edge Cases: Include challenging scenarios in your test suite
Document Justifications: Review evaluator justifications to understand score reasoning

Continuous Improvement

Regular Testing: Run evaluations consistently during development
Track Score Trends: Monitor evaluation scores over time
Calibrate Thresholds: Adjust score thresholds based on real-world performance
Update Test Cases: Expand test coverage as your application evolves

Integration Examples

CI/CD Pipeline Testing

#!/usr/bin/env python3
"""
CI/CD evaluation script
"""
import sys
from root import RootSignals

def main():
    client = RootSignals(api_key="your-api-key")
    
    # Define minimum acceptable scores
    thresholds = {
        "Relevance": 0.7,
        "Clarity": 0.65,
        "Helpfulness": 0.7,
        "SafetyForChildren": 0.9
    }
    
    # Test cases
    test_cases = [
        {
            "request": "How do I contact support?",
            "response": "You can contact support by calling 1-800-HELP or emailing [email protected]"
        },
        {
            "request": "What are your hours?",
            "response": "We're open Monday through Friday from 9 AM to 6 PM EST"
        }
    ]
    
    failures = []
    
    for i, test_case in enumerate(test_cases):
        print(f"Testing case {i+1}...")
        
        for evaluator_name, threshold in thresholds.items():
            evaluator_method = getattr(client.evaluators, evaluator_name)
            result = evaluator_method(
                request=test_case["request"],
                response=test_case["response"]
            )
            
            if result.score < threshold:
                failures.append({
                    "case": i+1,
                    "evaluator": evaluator_name,
                    "score": result.score,
                    "threshold": threshold,
                    "justification": result.justification
                })
    
    if failures:
        print("❌ Evaluation failures detected:")
        for failure in failures:
            print(f"  Case {failure['case']}: {failure['evaluator']} scored {failure['score']:.3f} (threshold: {failure['threshold']})")
        sys.exit(1)
    else:
        print("✅ All evaluations passed!")

if __name__ == "__main__":
    main()

Troubleshooting

Common Issues

1. Multiple Evaluators with Same Name If you encounter errors like "Multiple evaluators found with name 'X'", use evaluator IDs instead:

# Get evaluator by ID to avoid naming conflicts
evaluators = list(client.evaluators.list())
evaluator_id = next(e.id for e in evaluators if e.name == "Desired Evaluator Name")

result = client.evaluators.run(
    evaluator_id=evaluator_id,
    request="Your request",
    response="Your response"
)

2. Missing Required Parameters Some evaluators require specific parameters:

Ground Truth Evaluators: Require expected_output parameter
RAG Evaluators: Require contexts parameter as a list of strings
Function Call Evaluators: Require functions parameter

3. Evaluator Naming Conventions

Use direct property access: client.evaluators.Relevance()
For multi-word evaluators, use underscores: client.evaluators.Answer_Correctness()
Alternative: Use client.evaluators.run_by_name("evaluator_name") for dynamic names

Best Practices for Robust Testing

Handle Exceptions: Always wrap evaluator calls in try-catch blocks
Validate Parameters: Check required parameters before making calls
Use Consistent Naming: Follow the underscore convention for multi-word evaluators
Monitor API Limits: Be aware of rate limits when running batch evaluations

This comprehensive testing framework ensures your LLM applications meet quality, safety, and performance standards using Root Signals' extensive evaluator library and proven testing methodologies.

Find the best prompt and model

The prompt testing feature helps you to find the best prompt and a model combination for your use case. With Root or custom evaluators, you can skip the manual "eyeballing" of LLM outputs and iterate quickly.

You can compare metrics such as speed, cost, and output quality by checking the evaluation results.

Example: User feedback analyzer

Let's say you are running a SaaS product and you want to analyze and categorize user feedback. You want to find a good compromise between speed and quality in your model choice.

Let's start by installing the CLI and creating a prompt-tests.yaml file.

curl -sSL https://app.rootsignals.ai/cli/install.sh | sh

Then replace the content of the prompt-tests.yaml file with the following:

prompts:
  - |-
    Analyze the SaaS user feedback
    Text: {{user_input}}


inputs:
  - vars:
      user_input: "The dashboard takes forever to load when I have multiple projects. It's frustrating to wait every time I log in."
  - vars:
      user_input: "I really like the new search bar—it makes finding reports much easier. Could you also add filters by date range?"
  - vars:
      user_input: "The mobile app crashes whenever I try to upload a file larger than 50MB. This makes it unusable for my team."
  - vars:
      user_input: "Loving the collaboration features—comments and mentions are working perfectly. Keep it up!"
  - vars:
      user_input: "The analytics reports look great, but it would be useful to export them directly to Excel or Google Sheets."

models:
  - "gemini-2.5-flash-lite"

evaluators:
  - name: "Non-toxicity"
  - name: "Compliance-preview"
    contexts: # A policy document for the compliance evaluator
      - |-
          Product Feedback Categorization — Quick Guide

          Sentiment: Positive / Negative / Neutral
          Pick strongest tone; sarcasm = negative.
          Feature Area: Choose the main part of the product (e.g., Dashboard, Mobile, Notifications, Analytics, Billing, Integrations, Security).

          Request Type:
          Bug Report → something broken
          Usability Issue → hard/confusing to use
          Feature Request → asking for new capability
          Praise → compliment only
          Question → info-seeking

          Priority:
          High → blockers, crashes, security/data loss
          Medium → frequent bugs, core slowdowns, widely requested features
          Low → cosmetic, niche, one-off confusion
          Process: Read → assign sentiment → pick feature area → classify request → set priority.

We define one prompt template with five different inputs. We use two Root evaluators, where the compliance evaluator has a policy definition it uses to assess the LLM output.

Run the prompt testing with the following command:

roots prompt-test run

Results show the evaluation scores, latencies, outputs, and costs

Adding comparisons and structured output

Parsing raw text results is not the best way to build further integrations. So, let's add a section to the YAML file to define the output schema.

response_schema:
  type: "object"
  required: ["sentiment", "feature_area", "request_type", "suggested_priority"]
  properties:
    sentiment:
      type: "string"
      description: "Overall sentiment (e.g., negative, neutral, positive)"
    feature_area:
      type: "string"
      description: "Primary product area referenced in the feedback"
    request_type:
      type: "string"
      description: "Type of request (e.g., bug report, feature request, praise, usability issue)"
    suggested_priority:
      type: "string"
      description: "Suggested priority (e.g., low, medium, high)"
  additionalProperties: false

Let's also add another model and another, more detailed prompt. Here is the fully updated definition file:

prompts:
  - |-
    You are a customer feedback analyzer for a SaaS product.
    Your job is to read user feedback messages and return a structured JSON output.

    <instructions>
    - If multiple features are mentioned, pick the primary one.
    - If sentiment is mixed, pick the strongest overall tone.
    - If request_type is unclear, infer based on wording.
    </instructions>

    <user_input>
    {{user_input}}
    </user_input>
  - |-
    Analyze the SaaS user feedback
    Text: {{user_input}}

inputs:
  - vars:
      user_input: "The dashboard takes forever to load when I have multiple projects. It's frustrating to wait every time I log in."
  - vars:
      user_input: "I really like the new search bar—it makes finding reports much easier. Could you also add filters by date range?"
  - vars:
      user_input: "The mobile app crashes whenever I try to upload a file larger than 50MB. This makes it unusable for my team."
  - vars:
      user_input: "Loving the collaboration features—comments and mentions are working perfectly. Keep it up!"
  - vars:
      user_input: "The analytics reports look great, but it would be useful to export them directly to Excel or Google Sheets."
      
models:
  - "gemini-2.5-flash-lite"
  - "gpt-5"

evaluators:
  - name: "Non-toxicity"
  - name: "Compliance-preview"
    contexts:
      - |-
          Product Feedback Categorization — Quick Guide

          Sentiment: Positive / Negative / Neutral
          Pick strongest tone; sarcasm = negative.
          Feature Area: Choose the main part of the product (e.g., Dashboard, Mobile, Notifications, Analytics, Billing, Integrations, Security).

          Request Type:
          Bug Report → something broken
          Usability Issue → hard/confusing to use
          Feature Request → asking for new capability
          Praise → compliment only
          Question → info-seeking

          Priority:
          High → blockers, crashes, security/data loss
          Medium → frequent bugs, core slowdowns, widely requested features
          Low → cosmetic, niche, one-off confusion
          Process: Read → assign sentiment → pick feature area → classify request → set priority.

response_schema:
  type: "object"
  required: ["sentiment", "feature_area", "request_type", "suggested_priority"]
  properties:
    sentiment:
      type: "string"
      description: "Overall sentiment (e.g., negative, neutral, positive)"
    feature_area:
      type: "string"
      description: "Primary product area referenced in the feedback"
    request_type:
      type: "string"
      description: "Type of request (e.g., bug report, feature request, praise, usability issue)"
    suggested_priority:
      type: "string"
      description: "Suggested priority (e.g., low, medium, high)"
  additionalProperties: false

When we run this, we can see that the Gemini model is fast and cheap, but gets a lower score from the policy compliance evaluator in comparison. GPT-5 is considerably slower but receives a better score.

You can also inspect the results in the browser.

CLI

The roots CLI is a powerful tool for interacting with the Root Signals API, particularly for managing and executing Judges. This guide provides a brief overview of its capabilities.

Installation

To install the CLI, run the following command:

curl -sSL https://app.rootsignals.ai/cli/install.sh | sh

Authentication

The CLI requires an API key to be set as an environment variable.

export ROOTSIGNALS_API_KEY=$MY_ROOT_SIGNALS_API_KEY

Judge Management

The judge command is the primary entry point for all judge-related operations.

Creating a Judge

You can create a new judge using the create subcommand.

roots judge create --name "My New Judge" --intent "To evaluate the quality of LLM responses."

Arguments:

--name: The name of the judge (required).
--intent: The intent or purpose of the judge (required).
--stage: The stage of the judge.
--evaluator-references: A JSON string of evaluator references. Example: [{"id": "<faithfulness_uuid>"}, {"id": "<truthfulness_uuid>"}]

Listing Judges

To see a list of all available judges, use the list subcommand.

roots judge list

You can filter the list using various options like --search, --name, and --is-public.

Running a Judge

The execute subcommand allows you to run a judge with specific inputs.

roots judge execute <judge_id> --request "What is the capital of France?" --response "Paris"

Self-hosting

Documentation merging in progress - meanwhile, contact [email protected] for the full documentation

Integrations

Both our Skills and Evaluators may be used as custom-generator LLMs in 3rd party frameworks and we are committed to support OpenAI ChatResponse compatible API.

Note, however, that additional functionality, such as validation results, calibration etc., are not available as part of OpenAI responses and require the user to implement additional code if anything besides failing on unsuccessful validation is required.

Advanced use-cases can rely on referencing the completion.id returned by our API as unique identifier for downstream tasks. Please refer to the Cookbook section for details.

Haystack

Example requires Haystack version 2.2.0 or later

To unlock full functionality, create a custom component to wrap the RS skill that supports Root Signals Validators

from typing import Dict
from typing import List
from haystack import component
from root import RootSignals
from root.validators import Validator

@component
class RootSignalsGenerator:
    """
    Component to enable skill use
    """
    def __init__(self, name: str, intent: str, prompt: str, model: str, validators: List[Validator]):
        self.client = RootSignals()
        self.skill = self.client.skills.create(
            name=name,
            intent=intent,
            prompt=prompt,
            model=model,
            validators=validators,
        )

For convenience, lets create another component to parse validation results

from root.generated.openapi_client.models.skill_execution_result import SkillExecutionResult

@component
class RootSignalsValidationResultParser:
    @component.output_types(passed=bool, details=Dict[str, [str, float, bool]])
    def run(self, replies: Dict[str, SkillExecutionResult]):
        return {"passed": replies.validation['is_valid']}

We are now equipped to have any OpenAI compatible generator being replaced with a Validated one, based on the RootSignalsGenerator component.

from haystack.dataclasses import ChatMessage
from haystack.core.pipeline.pipeline import Pipeline
from haystack.components.builders.dynamic_chat_prompt_builder import DynamicChatPromptBuilder

generator_A = RootSignalsGenerator(
    name="My Q&A chatbot",
    intent="Simple Q&A chatbot",
    prompt="Provide a clear answer to the question: {{question}}",
    model="gpt-4o",
    validators=[Validator(evaluator_name="Clarity", threshold=0.6)]
)

pipeline = Pipeline(max_loops_allowed=1)
pipeline.add_component("prompt_builder", DynamicChatPromptBuilder())
pipeline.add_component("generator_A", generator_A)
pipeline.add_component("validation_parser", RootSignalsValidationResultParser())

pipeline.connect("prompt_builder.prompt", "generator_A.messages")
pipeline.connect("generator_A.replies", "validation_parser.replies")

prompt_template = """
    Answer the question below.
    
    Question: {{question}}
    """

result = pipeline.run(
    {
        "prompt_builder": {
            "prompt_source": [ChatMessage.from_user(prompt_template)],
            "template_variables": {
                "question": "In the field of software development, what is the meaning and significance of 'containerization'? Use a popular technology as example. Cite sources where available."
            },
        }
    },
    include_outputs_from={
        "generator_A",
        "validation_parser",
    },
)

{
  'validation_parser': {'passed': True},  # use this directly, i.e. for haystack routers
  'generator_A': {  # full response from the generator, use llm_output for the plain response
  'replies': SkillExecutionResult(
    llm_output='Containerization in software development refers to the practice of encapsulating an application and its dependencies into a "container" that can run consistently across different computing environments. This approach ensures that the software behaves the same regardless of where it is deployed <truncated> \nSources:\n- Docker. "What is a Container?" Docker, https://www.docker.com/resources/what-container.\n- Red Hat. "What is containerization?" Red Hat, https://www.redhat.com/en/topics/containers/what-is-containerization.',
    validation={'is_valid': True, 'validator_results': [{'evaluator_name': 'Clarity', 'evaluator_id': '603eae60-790b-4215-b6d3-301c16fc37c5', 'result': 0.85, 'threshold': 0.6, 'cost': 0.006645000000000001, 'is_valid': True, 'status': 'finished'}]}, 
    model='gpt-4o', 
    execution_log_id='1fbdd6fc-f5a7-4e30-a7dc-15549b7557ec', 
    rendered_prompt="Provide a clear answer to the question: Answer the question below.\n \n Question: In the field of software development, what is the meaning and significance of 'containerization'? Use a popular technology as example. Cite sources where available.", 
    cost=0.003835)
    }
}

LangGraph

Agentic RAG with Root Signals Relevance Judge

Replication of Agentic RAG tutorial from LangGraph, where the decision of whether to use the retrieved content or not to answer a question is powered by Root Signals Evaluators.

The following is from LangGraph docs:

%%capture --no-stderr
%pip install -U --quiet langchain-community tiktoken langchain-openai langchainhub chromadb langchain langgraph langchain-text-splitters

import getpass
import os


def _set_env(key: str):
    if key not in os.environ:
        os.environ[key] = getpass.getpass(f"{key}:")


_set_env("OPENAI_API_KEY")

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import Annotated, Sequence, Literal
from typing_extensions import TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages
from langchain import hub
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from langgraph.prebuilt import tools_condition
from langchain.tools.retriever import create_retriever_tool
from langgraph.graph import END, StateGraph, START
from langgraph.prebuilt import ToolNode
import pprint

urls = [
    "https://www.rootsignals.ai/post/evalops",
    "https://www.rootsignals.ai/post/llm-as-a-judge-vs-human-evaluation",
    "https://www.rootsignals.ai/post/root-signals-bulletin-january-2025",
]

docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=50
)
doc_splits = text_splitter.split_documents(docs_list)

# Add to vectorDB
vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding=OpenAIEmbeddings(),
)
retriever = vectorstore.as_retriever()

retriever_tool = create_retriever_tool(
    retriever,
    "retrieve_blog_posts",
    "Search and return information about Root Signals blog posts on LLM evaluation.",
)

tools = [retriever_tool]

class AgentState(TypedDict):
    # The add_messages function defines how an update should be processed
    # Default is to replace. add_messages says "append"
    messages: Annotated[Sequence[BaseMessage], add_messages]
    
### Nodes
def agent(state):
    """
    Invokes the agent model to generate a response based on the current state. Given
    the question, it will decide to retrieve using the retriever tool, or simply end.

    Args:
        state (messages): The current state

    Returns:
        dict: The updated state with the agent response appended to messages
    """
    print("---CALL AGENT---")
    messages = state["messages"]
    model = ChatOpenAI(temperature=0, streaming=True, model="gpt-4-turbo")
    model = model.bind_tools(tools)
    response = model.invoke(messages)
    # We return a list, because this will get added to the existing list
    return {"messages": [response]}


def rewrite(state):
    """
    Transform the query to produce a better question.

    Args:
        state (messages): The current state

    Returns:
        dict: The updated state with re-phrased question
    """

    print("---TRANSFORM QUERY---")
    messages = state["messages"]
    question = messages[0].content

    msg = [
        HumanMessage(
            content=f""" \n 
    Look at the input and try to reason about the underlying semantic intent / meaning. \n 
    Here is the initial question:
    \n ------- \n
    {question} 
    \n ------- \n
    Formulate an improved question: """,
        )
    ]

    # Grader
    model = ChatOpenAI(temperature=0, model="gpt-4-0125-preview", streaming=True)
    response = model.invoke(msg)
    return {"messages": [response]}


def generate(state):
    """
    Generate answer

    Args:
        state (messages): The current state

    Returns:
         dict: The updated state with re-phrased question
    """
    print("---GENERATE---")
    messages = state["messages"]
    question = messages[0].content
    last_message = messages[-1]

    docs = last_message.content

    # Prompt
    prompt = hub.pull("rlm/rag-prompt")

    # LLM
    llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, streaming=True)

    # Post-processing
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    # Chain
    rag_chain = prompt | llm | StrOutputParser()

    # Run
    response = rag_chain.invoke({"context": docs, "question": question})
    return {"messages": [response]}


print("*" * 20 + "Prompt[rlm/rag-prompt]" + "*" * 20)
prompt = hub.pull("rlm/rag-prompt").pretty_print()  # Show what the prompt looks like

Define the Decision-maker as a Root Judge

Now we define Root Signals Relevance evaluator as the decision maker for whether the answer should come from retrieved docs or not. The advantage of using Root Signals (as opposed to original LangGraph method) is:

We can control the relevance threshold because Root Signals evaluators always return a normalized score between 0 and 1.
If we want, we can incorporate the Justification in the decision-making process.
The code is much shorter, i.e. about ⅓ of that of LangGraph tutorial.

from root import RootSignals

client = RootSignals()

def grade_relevance(state) -> Literal["generate", "rewrite"]:
    """
    Determines whether the retrieved documents are relevant to the question.

    Args:
        state (messages): The current state

    Returns:
        str: A decision for whether the documents are relevant or not
    """
    messages = state["messages"]
    question = messages[0].content
    docs = messages[-1].content

    result = client.evaluators.Relevance(
        request=question,
        response=docs,
    )
    if result.score > 0.5:  # we can control the threshold
        return "generate"
    else:
        return "rewrite"

Rest of the tutorial is still from LangGraph:

# Define a new graph
workflow = StateGraph(AgentState)

# Define the nodes we will cycle between
workflow.add_node("agent", agent)  # agent
retrieve = ToolNode([retriever_tool])
workflow.add_node("retrieve", retrieve)  # retrieval
workflow.add_node("rewrite", rewrite)  # Re-writing the question
workflow.add_node(
    "generate", generate
)  # Generating a response after we know the documents are relevant
# Call agent node to decide to retrieve or not
workflow.add_edge(START, "agent")

# Decide whether to retrieve
workflow.add_conditional_edges(
    "agent",
    # Assess agent decision
    tools_condition,
    {
        # Translate the condition outputs to nodes in our graph
        "tools": "retrieve",
        END: END,
    },
)

# Edges taken after the `action` node is called.
workflow.add_conditional_edges(
    "retrieve",
    # Assess agent decision
    grade_relevance,  # this is Root Signals evaluator
)
workflow.add_edge("generate", END)
workflow.add_edge("rewrite", "agent")

# Compile
graph = workflow.compile()

Our RAG Agent is ready:

inputs = {
    "messages": [
        ("user", "What is EvalOps?"),
    ]
}
for output in graph.stream(inputs):
    for key, value in output.items():
        pprint.pprint(f"Output from node '{key}':")
        pprint.pprint("---")
        pprint.pprint(value, indent=2, width=80, depth=None)
    pprint.pprint("\n---\n")

LangChain

Coming Soon!

LlamaIndex

Coming Soon!

Langfuse

Example requires langfuse >=v3.0.0

Setup

from langfuse import observe, get_client
from root import RootSignals

# Initialize Langfuse client using environment variables
# LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_HOST
langfuse = get_client()

# Initialize RootSignals client
root_signals = RootSignals()

Real-Time Evaluation

Evaluate LLM responses as they are generated and automatically log scores to Langfuse.

Instrumented LLM Function

@observe(name="explain_concept_generation")  # Name for traces in Langfuse UI
def explain_concept(topic: str) -> tuple[str | None, str | None]:
    # Get the trace_id for the current operation, created by @observe
    current_trace_id = langfuse.get_current_trace_id()

    prompt = prompt_template.format(question=topic)
    response_obj = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="gpt-4",
    )
    content = response_obj.choices[0].message.content
    return content, current_trace_id

Evaluation Function

def evaluate_concept(request: str, response: str, trace_id: str) -> None:
    # Invoke a specific Root Signals judge
    result = root_signals.judges.run(
        judge_id="4d369224-dcfa-45e9-939d-075fa1dad99e",
        request=request,   # The input/prompt provided to the LLM
        response=response, # The LLM's output to be evaluated
    )

    # Iterate through evaluation results and log them as Langfuse scores
    for eval_result in result.evaluator_results:
        langfuse.create_score(
            trace_id=trace_id,                   # Links score to the specific Langfuse trace
            name=eval_result.evaluator_name,     # Name of the Root Signals evaluator (e.g., "Truthfulness")
            value=eval_result.score,             # Numerical score from the evaluator
            comment=eval_result.justification,   # Explanation for the score
        )

Usage

# Generate and evaluate
response, trace_id = explain_concept("What is photosynthesis?")
evaluate_concept("What is photosynthesis?", response, trace_id)

Mapping Root Signals to Langfuse

Root Signals

Langfuse

Description in Langfuse Context

evaluator_name

name

The name of the evaluation criterion (e.g., "Hallucination," "Conciseness"). Used for identifying and filtering scores.

score

value

The numerical score assigned by the Root Signals evaluator.

justification

comment

The textual explanation from Root Signals for the score, providing qualitative insight into the evaluation

Batch Evaluation

Evaluate traces that have already been observed and stored in Langfuse. This is useful for:

Running evaluations on historical data
Batch processing evaluations on production traces

Evaluating Historical Traces

from datetime import datetime, timedelta
from langfuse import get_client
from root import RootSignals

# Initialize clients
langfuse = get_client()  # uses environment variables to authenticate
root_signals = RootSignals()

if langfuse.auth_check():
    print("Langfuse client is authenticated and ready!")

# Fetch latest 10 traces from the last 24 hours
traces = langfuse.api.trace.list(
    limit=10,
    #tags=["my-tag"], # You can filter traces by tags
    from_timestamp=datetime.now() - timedelta(days=1),
).data

for trace in traces:
    trace_id = trace.id

    # Get all LLM generations for this trace
    observations = langfuse.api.observations.get_many(
        trace_id=trace_id,
        type="GENERATION",
        limit=100
    ).data

    for observation in observations:
        # Extract the LLM input and output
        input = observation.input[0]["parts"][0]["content"]
        output = observation.output[0]["parts"][0]["content"]

        # Run evaluation using Root Signals judge
        evaluation_result = root_signals.judges.run_by_name(
            "My awesome judge I created with scorable.ai",
            response=output,
            request=input,
        )

        # Log the evaluation results back to Langfuse
        for evaluator_result in evaluation_result.evaluator_results:
            langfuse.create_score(
                trace_id=trace_id,
                name=evaluator_result.evaluator_name,
                value=evaluator_result.score,
                comment=evaluator_result.justification,
            )

print("Evaluation complete!")

OpenTelemetry

Root Signals provides comprehensive (OTel) integration. allowing you to export detailed observability data from judge executions and evaluations to your preferred observability platform.

Overview

With Root Signals' OTel integration, you can:

Export traces from judge and evaluation executions to any OTel-compatible backend
Monitor evaluation performance with detailed timing and scoring metrics
Correlate RS evaluations with your existing application traces
View detailed execution flows in your observability platform of choice

All judge runs and individual evaluator executions are automatically instrumented and exported as OTel spans, providing complete visibility into your AI evaluation pipeline.

Configuration

Organization-Level Setup

OTel configuration is managed at the organization level in Root Signals. Each organization can configure their own OTel endpoint and settings.

Required Configuration:

Endpoint: Your OTel traces endpoint (e.g., https://tempo.example.com:443/v1/traces)
Headers: Authentication headers if required by your OTel backend
Enabled: Toggle to enable/disable OTel export

Optional Configuration:

Content Capture: Include user inputs and LLM responses in spans (disabled by default for privacy)
Sampling Rate: Control the percentage of traces exported (default: 100%)

Popular Backend Examples

Langfuse

Arize

Grafana Tempo

Jaeger

Exported Data

Trace Structure

Each judge execution creates a parent span with child spans for individual evaluations:

Span Attributes

Judge Span Attributes:

judge.name: Name of the judge being executed
judge.evaluator_count: Number of evaluators in the judge
judge.duration_seconds: Total execution time
: User input being evaluated (if content capture enabled)
: LLM response being evaluated (if content capture enabled)

Evaluation Span Attributes:

evaluation.name: Evaluator name (e.g., "Accuracy", "Clarity")
evaluation.score: Numerical score
evaluation.duration_ms: Evaluation execution time in milliseconds
evaluation.evaluator_id: Unique identifier for the evaluator instance
evaluation.justification: Detailed reasoning for the score (if content capture enabled)
: User input being evaluated (if content capture enabled)
: LLM response being evaluated (if content capture enabled)

Service Identification:

service.name: "root-signals-evaluation"
service.version: Root Signals API version
rs.component: Component type ("judge" or "evaluator")

Viewing Traces

In Root Signals

All traces are visible in the Root Signals web interface with detailed breakdowns, scores, and justifications.

Getting Started

Configure your OTel backend (LangFuse, Arize, etc.)
Set up Root Signals OTel configuration in your organization settings
Run a judge to generate your first traces
View traces in both Root Signals and your OTel platform
Set up dashboards and alerts based on your specific needs

Your evaluation traces will automatically appear in both the Root Signals interface and your configured OTel backend, providing comprehensive observability into your AI evaluation pipeline.

Vertex AI Agent Builder

Integrate Root Signals evaluations with Google Cloud's Vertex AI Agent Builder to monitor and improve your conversational AI agents in real-time.

Architecture Overview

🔧 Step-by-Step Integration

1. Set up a webhook in Vertex AI Agent Builder

Go to "Manage Fulfillment" in the Agent Builder UI.
Create a webhook (can be a Cloud Function, Cloud Run, or any HTTP endpoint).
This webhook will receive request and response pairs from user interactions.

2. Create a middleware endpoint (Cloud Function or Cloud Run)

This endpoint will:

Receive user input and the LLM response.
Construct an evaluator call to Root Signals API.
Send the result back as part of the webhook response (optional).

Option 1: Using Built-in Evaluators

Option 2: Using Custom Judges

3. Configure evaluators and judges

Built-in Evaluators:

Use evaluators like Relevance, Precision, Completeness, Clarity, etc.
Get available evaluators by logging in to https://app.rootsignals.ai/
Examples: Relevance, Truthfulness, Safety, Professional Writing

Custom Judges:

Create custom judges that combine multiple evaluators - use https://scorable.rootsignals.ai/ to generate a judge.
Judges provide aggregated scoring across multiple criteria

Frequently Asked Questions

Terminology

What is Intent for?

Intent is the high-level, human-understandable description of the attribute an Evaluator measures. For example: “To measure how clearly the returns handler explains the 20% discount offer on the next purchase”.

What are Datasets?

Datasets allow you to bring test data for benchmarking (Root & Custom) and optimizing (Custom) evaluators.

Behaviour

Does Intent change the behaviour of the evaluator?

No. Evaluator Intent does not alter the evaluator behaviour.

Does Calibration change the behaviour of the evaluator?

No. Calibration is for benchmarking (testing) the evaluators to understand whether they are "calibrated" to your expected/desired behaviour or not. Calibration samples do not alter the behaviour of the evaluators.

How do Demonstrations work?

Demonstrations are used as in-context few-shot samples combined with our well-tuned meta-prompt. They are not utilized for supervised fine-tuning (SFT).

Usage

Our stack is not in Python, can we still use Root Signals?

Absolutely. We have a REST API that you can run from your favourite tech stack.

Do I need to have Calibrations for all Custom Evaluators?

You do not have to bring Calibration samples but we strongly recommend at least a handful of them in order to understand the behaviour of the evaluators.

Can I change the behaviour of the evaluator by bringing labeled data?

You can change the behaviour of your Custom Evaluators by bringing annotated samples as Demonstrations. Behaviour of Root Evaluators can not be altered.

Can I run a previous version of a Custom Evaluator?

Yes.

If we already have a ground truth expected output, can we use your evaluators?

Yes. Various evaluators from us support reference-based evaluations where you can bring your ground truth expected responses. See our evaluator catalogue here.

Can I integrate Root Signals evaluators to experiment tracking tools such as MLflow etc.?

Yes. Our evaluators return a structured response (e.g. a dictionary) with scores, justifications, tags etc. These results can be logged to any experiment tracking system or database similar to any other metric, metadata, or attribute.

Models

What is the LLM that powers ready-made Root evaluators? Can I change it?

Root Evaluators are powered by various LLMs under the hood. This can not be changed except for on-premise deployments.

Are Evaluators/Judges deterministic?

No. We have tight confidence intervals (for the same input) but small fluctuations are to be expected. Expected standard deviations can be found in our docs.

Breaking Change Policy

We adhere to Semantic Versioning (SemVer) principles to manage the versions of our software products effectively. This ensures clarity and predictability in how updates and changes are handled.

Communication of Breaking Changes

Notification: All breaking changes are communicated to stakeholders via email. These notifications provide details about the nature of the change, the reasons behind it, and guidance on how to adapt to these changes.
Versioning: When a breaking change is introduced, the major version number of the software is incremented. For example, an upgrade from version 1.4.5 to 2.0.0 indicates the introduction of changes that may disrupt existing workflows or dependencies.
Documentation: Each major release accompanied by breaking changes includes updated documentation that highlights these changes and provides comprehensive migration instructions to assist in transitioning smoothly