Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Advanced use cases and common recipes
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
https://github.com/root-signals/rs-python-sdk
Objectives consist of a human-readable Intent and ground truth examples directly. An objective serves both the purposes of
Communication: Expressing the intended business purpose of the evaluator
Coordination: Serving as a battery of measures
Evaluation has been critical in all Machine Learning/AI systems for decades but it was not a problem until generative AI. Because back then, we had our test sets with ground truth annotations and simply calculated proportions/ratios (e.g. accuracy, precision, recall, AUROC, f1-score etc.) or well defined metric formulas (e.g. mean absolute error, some custom metric etc.) to estimate the performance of our systems. If we were satisfied with the accuracy & latency, we deployed our AI models to production.
We can not do that anymore because LLMs
output free text instead of pre-defined categories or numerical values
are non-deterministic
are instructable by semantic guidance in other words they have a prompt. Their behaviour depends on the which is difficult to predict beforehand.
Therefore, applications powered by LLMs are inherently unpredictable, unreliable, weird, and in general hard to control.
This is the main blocker of large scale adoption and value creation with Generative AI. To overcome this, we need a new way of measuring, monitoring, and guardrailing AI systems.
Yes, there are numerous LLM benchmarks and leaderboards, yet
They measure LLMs, not LLM applications. Benchmarks are interested in low-level academic metrics that are far away from business goals.
Tasks and samples in those benchmarks do not reflect real-life use cases. For example, multiple choice high school geometry question answering performance is not relevant when one is developing a customer support chatbot that should not hallucinate.
Benchmarks are full of low quality, incomplete, ambiguous, and erroneous samples
Data leakage is rampant. Consciously or not, test samples or slight variations of them are often leaked into the training data.
Benchmarks are not always transparent about what kind of settings they use in the experiments (e.g. temperature, zero-shot or few-shot, prompts) and are hard to replicate.
In short,
You want to measure and monitor your specific LLM-powered automation, not the generic academic capabilities of an LLM.
Any developer can use Root Signals to:
Add appropriate metrics such as Truthfulness, Answer Relevance, or Coherence of responses to any LLM pipeline and optimize their design choices (which LLM to use, prompt, RAG hyper-parameters, etc.) using these measurements:
Log, record, and compare their changes and the corresponding measurements belonging to those
Integrate metrics into CI/CD pipelines (e.g. GitHub actions) to prevent regressions
Turn those metrics into guardrails that prevent wrong, inappropriate, undesirable, or in general sub-optimal behaviour of their LLM apps simply by adding trigger thresholds. Monitor the performance in real-time, in production.
Create custom metrics for attributes ranging from 'mention of politics' to 'adherence to our communications policy document v3.1'.
The dashboard provides a comprehensive overview of the performance of your specific LLM applications:
Root Signals provides 30+ built-in, ready-to-use evaluators called Root Evaluators.
Utilizing any LLM as a judge, you can create, benchmark, and tune Custom Evaluators.
We provide complete observability to your LLM applications through our Monitoring view.
Root Signals is available via
Root Signals can be used by individuals and organizations. Role-based Access Controls (RBAC), SLA, and security definitions are available for organizations. Enterprise customers also enjoy SSO signups via Okta and SAML.
Root Signals design philosophy starts from the principle of extreme semantic rigor. Briefly, this means making sure that, for example
The definitions and references of entities are tracked with maximal (and increasing) precision
Entities are assumed long-term and upgradeable
Entities are built for re-use
Changes will be auditable
Objective defines what you intend to achieve. It grounds an AI automation to the business target, such as providing a feature ('transform data source X into usable format Y') or a value ('suitability for use in Z').
Evaluator is a function that assigns a numeric value to a piece of content such as text, along a semantically defined dimension (truthfulness, relevance of an answer, coherence, etc.).
Models are the actual source of the intelligence. A model generally refers to both the type of model (such as GPT), the provider of the model (such as Azure), and the specific variant (such as GPT-4o). The models available on the Root Signals platform consist of:
Proprietary and hosted open-source models accessible via API. These models can be accessed via your API key or the Root Signals platform key (with corresponding billing responsibilities).
Open-source models provided by Root Signals.
Some model providers are GDPR-compliant, ensuring data processing meets the General Data Protection Regulation (GDPR) requirements. However, please note that GDPR compliance by the provider does not necessarily mean that data is processed within the EU.
Organization admin can control the API keys and restrict access to a specific subset of models.
is a measurement, observability, and control platform for GenAI applications, automations, and agentic workflows powered by Large Language Models (LLMs). Such applications include chatbots, Retrieval Augmented Generation (RAG) systems, agents, data extractors, summarizers, translators, AI assistants, and various automations powered by LLMs.
🖥️
🐍
📑
Create a and get started in 30 seconds.
Model is the AI model such as an LLM that provides the semantic processing of the inputs. Notably, the list contains both API-based models such as OpenAI and Anthropic models, and open source models such as Llama and Mistral models. Finally, you can add your own locally running models to the list . The organization Admin controls the availability of models enabled in your organization.
Models added by your organization. See the for model details.
To get started, and to Root Signals app. Select an evaluator under Evaluators tab and Execute. You will get a score between 0
and 1
and the justification for the score.
Create your Root Signals API key under .
For the best experience, we encourage you to create an account. However, if you prefer to run quick tests at this point, please create a temporary API key .
Root Signals provides over or judges, which you can use to score any text based on a wealth of metrics. You can attach evaluators to an existing application with just a few lines of code.
You can execute evaluators in your favourite framework and tech stack via our :
Access to all Root Signals entities is controlled via user roles as well as entity-specific settings by the entity creator. The fundamental roles are the User and the Administrator.
Administrators have additional entity management privileges when it comes to management of datasets, objectives, skills and evaluators:
Administrators can see all the entities in the organization, including unlisted ones. This allows them to have an overview of all the data and functional assets in the organization.
Administrators can change the status of any entity such as a dataset from unlisted to listed and vice versa. This enables them to control which entities are shared with the wider organization.
Administrators can delete any entity, regardless of who created it. This is useful for managing obsolete or irrelevant entities.
Administrator also controls the accessibility of models across the organization, users, and billing.
The requests to any models wrapped within skill objects, and their responses, are traceable within the log objects or Root Signals platform.
The retention of logs is determined by your platform license. You may export logs at any point for your local storage. Access to execution logs is restricted based on your user role and skill-specific access permissions.
Objectives, evaluators, skills and test datasets are strictly versioned. The version history allows keeping track of all local changes that could affect the execution.
To understand reproducibility of pipelines of generative models, these general principles hold:
For any models, we can control for the exact inputs to the model, record the responses received, and the evaluator results of each run.
For open source models, we can pinpoint the exact version of the model (weights) being used, if this is guaranteed by the model provider, or if the provider is Root Signals.
For proprietary models whose weights are not available, we can pinpoint the version based on the version information given by the providers (such as gpt-4-turbo-2024-04-09) but we cannot guarantee those models are, in reality, fully immutable
Any LLM request with 'temperature' parameter above 0 is guaranteed not to be deterministic. Temperature = 0 and/or a fixed value of a 'seed' parameter usually mean the result is deterministic, but your mileage may vary.
Datasets in Root Signals contain static information that can be included as context for skill execution. They can contain information about your organization, products, customers, etc. Datasets are linked to skills using reference variables.
Access to datasets is controlled through permissions. By default, when a user uploads a new dataset, it is set to 'unlisted' status. Unlisted datasets are only visible to the user who created them and to administrators in the organization. This allows users to work on datasets privately until they are ready to be shared with others.
To make a dataset available to other users in the organization, the dataset owner or an administrator needs to change the status to 'listed'. Listed datasets are visible to all users in the organization and can be used in skills by anyone.
Note that dataset permissions control whether a dataset can be used in skill creation or skill editing as a reference variable or as a test data set. Unless more specific permissions information is made available via enterprise integrations, dataset permissions do not control who can use the data set in skill execution. I.e. once dataset in fixed to a skill as a reference variable, anyone who has privileges to execute the skill will also have implicit access to the data set through the skill execution.
It is important for dataset owners and administrators to carefully consider the sensitivity and relevance of datasets before making them widely available. Datasets may contain confidential or proprietary information that should only be accessible to authorized users.
Contact Root Signals for more fine-grained controls in enterprise, regulated or governmental contexts.
The dataset permission system in Root Signals allows for granular control over who can access and use specific datasets. The unlisted/listed status toggle and the special privileges granted to administrators provide flexibility in managing data assets across the organization. Proper management of dataset permissions is crucial for ensuring data security and relevance in skill development and execution.
Datasets in Root Signals contain static information that can be included as context for skill execution. They allow you to provide additional data to your skills, such as information about your organization, products, customers, or any other relevant domain knowledge.
By leveraging data sets, you can enhance the capabilities of your skills and provide them with relevant domain knowledge or test data to ensure their performance and accuracy.
To import a new data set:
Navigate to the Data Sets view.
Click the "Import Data Set" button on the top right corner of the screen.
Enter a name for your data set. If no name is provided, the file name will be used as the data set name.
Choose the data set type:
Reference Data: Used for skills that require additional context.
Test Data: Used for defining test cases and validating skill or evaluator performance.
Select a tag for the data set or create a new one.
Either upload a file or provide a URL from which the system can retrieve the data.
Preview the data set by clicking the "Preview" button on the bottom right corner.
Save the data set by clicking the "Submit" button.
Data sets can be linked to skills using reference variables. When defining a skill, you can choose a data set as a reference variable, and the skill will have access to that data set during execution. This allows you to provide additional context or information to the skill based on the selected data set.
When creating a new skill or an evaluator, you have the option to select a test data or a calibration data set, correspondingly, to drive the skill or evaluator with multiple predefined sequential inputs for the skill's performance evaluation.
Root Signals allows you to test your skill against multiple models simultaneously. In the "Prompts" and "Models" sections of the skill creation form, you can add multiple prompt variants and select one or more models to be tested, correspondingly. By clicking the "Test" / "Calibrate" button in the bottom right corner, the system will run tests using your selected test data set against each of the chosen prompts and models. This feature enables you to compare their performance and select the one with the best trade-offs for your use case.
Root Signals provides a rich collection of pre-built evaluators that you can use, such as:
Quality of professional writing: checks how grammatically correct, clear, concise and precise the output is
Toxicity Detection: Identifies any toxic or inappropriate content in the skill's output.
Faithfulness: Verifies the accuracy of information provided by the skill.
Sentiment Analysis: Determines the overall sentiment (positive, negative, or neutral) of the skill's output.
You can also define your own custom evaluators.
Evaluators are a special type of skill in Root Signals that assess the performance and quality of the outputs of operational skills. Custom evaluators are similar in structure to normal skills, consisting of a name, objective, and function.
The objective of an evaluator skill consists of two components:
Intent: This describes the purpose and goal of the evaluator skill, specifying what it aims to evaluate or assess in the outputs of other skills.
Calibrator: The calibrator serves a similar role to test data in operational skills. It provides the ground truth set of appropriate numeric values for specific request-response pairs that defines the intended behavior of the evaluator. This set 'calibrates' its evaluation criteria and ensures consistent and accurate assessments.
The function of an evaluator skill consists of three components:
Prompt
Demonstrations
Model
The prompt (or instruction) defines the instructions and variable content the evaluator prompts a large language model with. It should clearly specify the criteria and guidelines for assessing the quality and performance of skill outputs.
Note: During skill execution, the prompt defined by the user is appended to a more general template containing instructions responsible for guiding and optimizing the behavior of the evaluator. Thus the user does not have to bother with with generic instructions such as "Give a score between 0 and 1". It is sufficient to describe the evaluation criteria of the specific evaluator at hand.
Example: How well does the {{response}} adhere to instructions given in {{request}}.
All variable types are available for an evaluator skill. However, some restrictions apply.
The prompt of an evaluator must contain a special variable named response
that represents the LLM output to be evaluated.
It can also contain a special variable named request
if the prompt that produced the input is considered relevant for evaluation.
request
and response
can be either input or reference variables. In the latter case the variable is associated with a dataset that can be searched for contextual information to support the evaluation, using Retrieval Augment Generation.
A demonstration is a sample consisting of an response-request -pair (or just response, if request is not considered necessary for evaluation), an expected score, and optional justification. Demonstrations exemplify the expected behavior of the evaluator skill. Demonstration is provided to the model, and therefore must be strictly separated from any evaluation or calibration of the related AI skill.
A justification illustrates the rationale for the given score. Justification can be helpful when the reason for a specific score is not obvious, allowing the model to pay attention to relevant aspects of the evaluated response and tackle ambiguous cases in a nuanced way.
Example:
A sample demonstration for an evaluator for determining if content is safe for children.
The model refers to the specific language model or engine used to execute the evaluator skill. It should be chosen based on its capabilities and suitability for the evaluation task..
Unlike operational skills, evaluator skills do not have validators associated with them.
Calibration is the response to the naturally arising question: How can we trust evaluation results? The calibrator provides a way to quantify the performance of the evaluator by providing the ground truth against which the evaluator can be gauged. The reference dataset that forms the calibrator defines the expected behavior of the evaluator.
The samples of the calibration dataset are similar to the to those of the demonstration dataset, consisting of score, response, and optional request and optional justification.
On the Calibrator page:
The calibration dataset can be imported on a file or typed in the editor.
A synthetic dataset can be generated, edited, and appended.
Calibration page enables viewing the actual performance of the evaluator on the samples of the calibrator dataset with the expected performance defined by the scores of the set.
On this page:
Total deviance, which quantifies the average magnitude of the errors between predicted values by an evaluator and the actual observed values, can be calculated. A lower total deviance indicates that the evaluator's predictions are closer to the actual outcomes, which signifies better performance of the evaluator. The total deviance is computed using the Root Mean Square method.
Deviations for individual samples of the dataset are displayed, enabling easy identification of weak points of the evaluator. If a particular sample has a high deviation, there are characteristics in the sample that confuse the evaluator.
To improve the performance or 'calibrate' an evaluator, adjustments can be made to one or more of the three key components: the prompt, the demonstrations, and the model.
Effective strategies for this can be deduced by examining the calibration results. Inspecting the worst-performing samples, those with the largest deviations, can help identify the evaluator's weak points.
Then, one or more steps can be taken:
The instructions given in the prompt can be made more specific to adjust the behavior in the problem cases.
Modify demonstration content by adding examples similar to the problematic samples, which can enhance performance in these areas. Additional instructions can be added by including a justification to a demonstration. Note: Maintaining a sufficiently large calibration dataset reduces the risk of overfitting, i.e., producing an evaluator tailored to the calibration but lacking generalization.
The model can be changed. Overall performance can be improved by using a larger or otherwise better suited model, often at the cost of evaluation latency and price.
After each modification, it's advisable to recalculate the deviations to assess the direction and magnitude of the impact on performance.
Evaluators tagged with RAG Evaluator work properly when evaluating skills with reference variables. Alternatively, when not used to evaluate skill outputs, contexts
parameter containing a set of documents as a list of strings—corresponding to the retrieved context data—must be passed.
Evaluators tagged with Ground Truth Evaluator can be used for evaluating tests sets that contain an expected_output
column. When used through the SDK, expected_output
parameter must be likewise passed.
Evaluators tagged with Function Call Evaluator can be used through SDK and require a functions
parameter conforming to OpenAI compatible tools parameter to be passed.
Relevance Assesses the relevance of the response in relation to the request by evaluating accuracy, completeness, adherence to the prompt, and logical consistency, to determine whether responses remain directly on-topic and informative.
Safety for Children Checks the appropriateness of content for young audiences, focusing on avoiding language or themes that could be harmful or unsuitable for children, thus promoting safety and age-appropriateness.
Sentiment Recognition Identifies the emotional tone of the response, determining whether it is predominantly positive, negative, or neutral, which helps in gauging the overall sentiment conveyed by the content.
Coherence Assesses whether the response is logically consistent and contextually relevant, checking for smooth transitions between ideas and the maintenance of topic focus throughout the response.
Conciseness Measures the brevity and directness of the response, aiming to convey information efficiently without the use of unnecessary words, thus avoiding redundancy.
Engagingness Evaluates the ability of the response to engage readers, determining whether the content is interesting and likely to hold attention, or if it tends toward being boring or predictable.
Originality Checks the originality and creativity of the response, focusing on providing unique and inventive perspectives rather than relying on clichéd or derivative ideas.
Clarity Measures how easily the response can be understood on a first reading, aiming for ideas to be presented in a logical, straightforward manner with sufficient detail for clarity.
Precision Assesses the accuracy and specificity of the response, aiming for responses that are detailed and exact, with no room for ambiguities or overly general statements.
Persuasiveness Evaluates the persuasiveness of the response by examining the strength, clarity, and convincingness of the main argument or point, including how well it is supported by evidence, examples, and logical reasoning.
Confidentiality Assesses the response for the presence of any personal details, private communications, sensitive business data, or any other information that should be kept confidential, focusing on maintaining privacy and security.
Harmlessness Assesses the harmlessness of the response by checking for the presence of hate speech, slurs, derogatory language, and any incitement or glorification of violence. It evaluates the overall potential of the content to cause harm or distress.
Formality Evaluates the formality of the response by considering factors such as word choice, sentence structure, tone, grammar, and overall style. This helps in matching the content to the expected level of formality for the context.
Politeness Assesses the politeness of the response by examining factors such as word choice, tone, phrasing, and the overall level of respect and courtesy demonstrated in the response.
Helpfulness Evaluates the helpfulness of the response by considering how useful, informative, and beneficial the text is to a reader seeking information. Helpful text provides clear, accurate, relevant, and comprehensive information to aid the reader's understanding and ability to take appropriate action.
Non-toxicity Assesses the non-toxicity of the response. Text that is benign and completely harmless receives high scores.
Faithfulness RAG Evaluator This corresponds to hallucination detection in RAG settings. Measures the factual consistency of the generated answer with respect to the context. It determines whether the response accurately reflects the information provided in the context. This is the high-accuracy variant of our set of Faithfulness evaluators.
Faithfulness-swift RAG Evaluator
This is the faster variant of our set of Faithfulness evaluators.
Answer Relevance Measures how relevant a response is with respect to the prompt/query. Completeness and conciseness of the response are considered.
Truthfulness RAG Evaluator Assesses factual accuracy by prioritizing context-backed claims over model knowledge, while preserving partial validity for logically consistent but unverifiable claims. Unlike Faithfulness, allows for valid model-sourced information beyond the context. This is the high-accuracy variant of our set of Truthfulness evaluators.
Truthfulness-swift RAG Evaluator This is the faster variant of our set of Truthfulness evaluators.
Quality of Writing - Professional Measures the quality of writing as a piece of academic or other professional text. It evaluates the formality, correctness, and appropriateness of the writing style, aiming to match professional standards.
Quality of Writing - Creative Measures the quality of writing as a piece of creative text. It evaluates the creativity, expressiveness, and originality of the content, focusing on its impact and artistic expression.
JSON Content Accuracy RAG Evaluator | Function Call Evaluator Checks if the content of the JSON response is accurate and matches the documents and instructions, verifying that the JSON data correctly represents the intended information.
JSON Property Completeness Function Call Evaluator Checks how many of the required properties are present in the JSON response, verifying that all necessary fields are included. This is a string (non-LLM) evaluator.
JSON Property Type Accuracy Function Call Evaluator Checks if the types of properties in the JSON response match the expected types, verifying that the data types are correct and consistent. This is a string (non-LLM) evaluator.
JSON Property Name Accuracy Function Call Evaluator Checks if the names of properties in the JSON response match the expected names, verifying that the field names are correct and standardized. This is a string (non-LLM) evaluator.
JSON Empty Values Ratio Function Call Evaluator Checks the portion of empty values in the JSON response, aiming to minimize missing information and ensure data completeness. This is a string (non-LLM) evaluator.
Answer Semantic Similarity Ground Truth Evaluator Measures the semantic similarity between the generated answer and the ground truth, helping to evaluate how well the response mirrors the expected response.
Answer Correctness Ground Truth Evaluator Measures the factual correspondence of the generated response against a user-supplied ground truth. It considers both semantic similarity and factual consistency.
Context Recall RAG Evaluator | Ground Truth Evaluator Measures whether the retrieved context provides sufficient information to produce the ground truth response, evaluating if the context is relevant and comprehensive according to the expected output.
Context Precision RAG Evaluator | Ground Truth Evaluator Measures the relevance of the retrieved contexts to the expected output.
As our evaluators are LLM-judges, they are non-deterministic, i.e, the same input can result in slightly different score. We try to keep this fluctuation low. The expected standard deviations of each evaluator for 3 different dimensions are reported below: short/long context, single-turn / multi, low/high ground truth score:
See documentation.
An is a metric for a piece of text that maps a string originating from a language model to a numeric value between 0 and 1. For example, an evaluator could measure the "Truthfulness" of the generated text. When coupled with a threshold value, they instead serve as validators for non-evaluator skills.
As evaluators are special type of skill, , apply to evaluator skills too.
Root Signals subscription provides you can use in any of your skills. You are not limited by that selection, though. Integrating with cloud providers' models or connecting to locally hosted models is possible or the REST API.
To use an , add the model endpoint via the SDK or through the REST API
Building production-ready and reliable AI applications requires safeguards provided by an evaluation layer. LLM responses can vary drastically based on even the slightest input changes.
Root Signals provides a robust set of fundamental evaluators suitable for any LLM-based application.
You need a few examples of LLM outputs (text). Those can be from any source, such as a summarization output on a given topic.
Let's start with the Precision evaluator. Based on the text you want to evaluate, feel free to try other evaluators as well.
Click on the Precision evaluator and then click on the Execute skill button.
Paste the text you want to evaluate into the output field and click Execute. You will get a numeric score based on the metric the evaluator is evaluating and the text to evaluate.
An individual score is not very interesting. The power of evaluation lies in integrating evaluators into an LLM application.
Integrating the evaluators as part of your LLM application is a more systematic approach to evaluating LLM outputs. That way, you can compare the scores over time and take action based on the evaluation results.
The Precision evaluator details page contains information on how to add it to your application. First, you must fetch a Root Signals API key and then execute the example cURL command.
Go to the Precision evaluator details page
Click on the Add to your application link
Copy the cURL command
You can omit the request
field from the data payload and add the text to evaluate in the response
field.
Example (cURL)
Root Signals provides evaluators that fit most needs, but you can add custom evaluators for specific needs. In this guide, we will add a custom evaluator and tune its performance using demonstrations.
Consider a use case where you need to evaluate a text based on its number of weasel words or ambiguous phrases. Root Signals provides the optimized Precision evaluator for this, but let's build something similar to go through the evaluator-building process.
Navigate to the Evaluator Page:
Go to the evaluator page and click on "New Evaluator."
Name Your Evaluator:
Type the name for the evaluator, for example, "Direct language."
Define the Intent:
Give the evaluator an intent, such as "Ensures the text does not contain weasel words."
Create the Prompt:
"Is the following text clear and has no weasel words"
Add a placeholder (variable) for the text to evaluate:
Click on the "Add Variable" button to add a placeholder for the text to evaluate.
E.g., "Is the following text clear and has no weasel words: {{response}}"
Select the Model:
Choose the model, such as gpt-4-turbo, for this evaluation.
Save and Test the Evaluator:
You can add demonstrations to the evaluator to tune its scores to match more closely to the desired behavior.
Let's penalize using the word "probably"
Go to the Weasel words evaluator and click Edit
Edit the Prompts sections
Add a demonstration
For the output: "This solution will probably work for most users."
Score: 0,1
Save the evaluator and try it out
Note that adding more demonstrations, such as
"The project will probably be completed on time."
"We probably won't need to make any major changes."
"He probably knows the answer to your question."
"There will probably be a meeting tomorrow."
"It will probably rain later today."
Both our Skills and Evaluators may be used as custom-generator LLMs in 3rd party frameworks and we are committed to support OpenAI ChatResponse compatible API.
Note, however, that additional functionality, such as validation results, calibration etc., are not available as part of OpenAI responses and require the user to implement additional code if anything besides failing on unsuccessful validation is required.
shows all evaluators at your disposal. Root Signals provides the base evaluators, but you can also build custom evaluators for specific needs.
Click Create evaluator and .
will further adjust the evaluator's behavior. Refer to the full evaluator for more information.
Advanced use-cases can rely on referencing the completion.id
returned by our API as unique identifier for downstream tasks. Please refer to the section for details.
Coming Soon!
In Root Signals, evaluation is treated as a procedure to compute a metric grounded on a human-defined criteria, emphasizing the separation of utility grounding (Objective) and implementation (Evaluator function).
This lets the criteria and implementations for the evaluations evolve in two separate controlled and trackable tracks, each with different version control logic.
Metric evaluators are different from other entities in the world, and simply treating them as "grounded in data", on one hand, or as "tests", on the other, misses some of their core properties.
In Root Signals, an Objective consist of
Intent that is human-defined and human-understanable, corresponding to the precise attribute being meausred.
Calibration data set that defines, via examples, the structure and scale of those criteria.
An Evaluator function consists of:
Predicate that uniquely specifies the task to the LLMs that power the evaluator
LLM
In-context examples (demonstrations)
[Optionally] Associated data files
An Evaluator function is typically associated with an Objective that connects it to business / contextual value, but the two have no causal connection.
Root Signals platform itself handles:
Semantic quantization: Guaranteeing the predicates are consistently mapped to metrics (for supported LLMs). This lets us abstract the predicates out of the boilerplate prompts needed to yield robust metrics
Version control of evaluator implementations
Maintenance of relationships* between Objectives and Evaluators
Monitoring
E.g. If an Objective is changed (e.g. it's calibration dataset is altered), it is not a priori clear if the related criteria, which then affect all evaluator variants using the Objective, rendering measurements backwards-incompatible. Hence, the best-practise enforced by Root Signals platform is to create an entirely new Objective, so that it is clear the criteria have changed. This can be bypassed, however, when the Objective is still in formation stage and/or you accept that the criteria will change over time.
Over time, improved evaluator functions will be created (including but not limited to model updates) to improve upon the Objective targets. On the other hand, Objectives tend to branch and become more precise over time, passing the burden of resolving the question of "is this still the same Objective" to the users, while providing the software support to make those calls either way in an auditable and controllable manner.
Start by attaching an empty calibration set to the evaluator:
Navigate to the Direct Language evaluator page and click Edit.
Select the Calibration section and click Add Dataset.
Name the dataset (e.g., “Direct Language Calibration Set”).
Optionally, add sample rows, such as:
Click Save and close the dataset editor.
Optionally, click the Calibrate button to run the calibration set.
Save the evaluator
You can enhance your calibration set using real-world data from evaluator runs stored in the execution log.
Locate a relevant evaluator run and click on it.
Click Add to Calibration Dataset to include its output and score in the calibration set.
By regularly updating and running the calibration set, you safeguard the evaluator against unexpected behavior, ensuring its continued accuracy and reliability.
To unlock full functionality, create a custom component to wrap the RS skill that supports Root Signals Validators
For convenience, lets create another component to parse validation results
We are now equipped to have any OpenAI compatible generator being replaced with a Validated one, based on the RootSignalsGenerator
component.
To ensure the reliability of the evaluator, you can create and use test data, referred to as a calibration dataset. A calibration set is a collection of LLM outputs, prompts, and expected scores that serve as benchmarks for evaluator performance.
Go to the page.
Coming Soon!
We adhere to Semantic Versioning (SemVer) principles to manage the versions of our software products effectively. This ensures clarity and predictability in how updates and changes are handled.
Communication of Breaking Changes
Notification: All breaking changes are communicated to stakeholders via email. These notifications provide details about the nature of the change, the reasons behind it, and guidance on how to adapt to these changes.
Versioning: When a breaking change is introduced, the major version number of the software is incremented. For example, an upgrade from version 1.4.5 to 2.0.0 indicates the introduction of changes that may disrupt existing workflows or dependencies.
Documentation: Each major release accompanied by breaking changes includes updated documentation that highlights these changes and provides comprehensive migration instructions to assist in transitioning smoothly
Agentic RAG with Root Signals Relevance Judge
The following is from LangGraph docs:
Define the Decision-maker as a Root Judge
Now we define Root Signals Relevance evaluator as the decision maker for whether the answer should come from retrieved docs or not. The advantage of using Root Signals (as opposed to original LangGraph method) is:
We can control the relevance threshold because Root Signals evaluators always return a normalized score between 0
and 1
.
If we want, we can incorporate the Justification in the decision-making process.
The code is much shorter, i.e. about ⅓ of that of LangGraph tutorial.
Rest of the tutorial is still from LangGraph:
Our RAG Agent is ready:
Replication of from , where the decision of whether to use the retrieved content or not to answer a question is powered by Root Signals Evaluators.
Absolutely. We have a that you can run from your favourite tech stack.
Yes. Various evaluators from us support reference-based evaluations where you can bring your ground truth expected responses. See our .
You can use arbitrary tags for evaluation executions. See the .
Root Signals provides evaluators for RAG use cases, where you can give the context as part of the evaluated content.
One such evaluator is the Truthfulness evaluator, which measures the factual consistency of the generated answer against the given context and general knowledge.
Here is an example of running the Truthfulness evaluator using the Python SDK. Pass the context used to get the LLM response in the contexts parameter.