Evaluators

An evaluator is a metric for a piece of text that maps a string originating from a language model to a numeric value between 0 and 1. For example, an evaluator could measure the "Truthfulness" of the generated text.

Root Signals provides a rich collection of pre-built evaluators that you can use, such as:

Quality of professional writing: checks how grammatically correct, clear, concise and precise the output is
Completeness: evaluates how well the response addresses all aspects of the input request
Toxicity Detection: Identifies any toxic or inappropriate content
Faithfulness: Verifies the faithfulness of response with respect to a given context, acting as a hallucination detection, e.g. in RAG settings
Sentiment Analysis: Determines the overall sentiment (positive, negative, or neutral)

You can also define your own custom evaluators.

Objective

The objective of an evaluator consists of two components:

Intent: This describes the purpose and goal of the evaluator, specifying what it aims to evaluate or assess in the response.
Calibrator: It provides the ground truth set of appropriate numeric values for specific request-response pairs that defines the intended behavior of the evaluator. This set 'calibrates' its evaluation criteria and ensures consistent and accurate assessments.

Function

The function of an evaluator consists of three components:

Prompt
Demonstrations
Model

Prompt

The prompt (or instruction) defines the instructions and variable content the evaluator prompts a large language model with. It should clearly specify the criteria and guidelines for assessing the quality and performance of responses.

Note: During execution, the prompt defined by the user is appended to a more general template containing instructions responsible for guiding and optimizing the behavior of the evaluator. Thus the user does not have to bother with with generic instructions such as "Give a score between 0 and 1". It is sufficient to describe the evaluation criteria of the specific evaluator at hand.

Example: How well does the {{response}} adhere to instructions given in {{request}}.

Variables in an evaluator

All variable types are available for an evaluator. However, some restrictions apply.

The prompt of an evaluator must contain a special variable named response that represents the LLM output to be evaluated.
It can also contain a special variable named request if the prompt that produced the input is considered relevant for evaluation.

request and response can be either input or reference variables. In the latter case the variable is associated with a dataset that can be searched for contextual information to support the evaluation, using Retrieval Augment Generation.

Demonstrations

A demonstration is a sample consisting of an response-request -pair (or just response, if request is not considered necessary for evaluation), an expected score, and optional justification. Demonstrations exemplify the expected behavior of the evaluator. Demonstration is provided to the model, and therefore must be strictly separated from calibration samples.

A justification illustrates the rationale for the given score. Justification can be helpful when the reason for a specific score is not obvious, allowing the model to pay attention to relevant aspects of the evaluated response and tackle ambiguous cases in a nuanced way.

Example:

A sample demonstration for an evaluator for determining if content is safe for children.

Request: "Is there a refund option?",
Response: "Yes, there is a refund option available. According to clause 4.2 of the terms of business, if the engagement terminates within the first 3 months (except in cases of redundancy), a refund will be provided based on the schedule outlined in the document.",
Score: 1.00,
Justification: "While difficult and boring for children, the text does not involve unsafe elements.",

Model

The model refers to the specific language model or engine used to execute the evaluator. It should be chosen based on its capabilities and suitability for the evaluation task..

Calibration

Calibration is the response to the naturally arising question: How can we trust evaluation results? The calibrator provides a way to quantify the performance of the evaluator by providing the ground truth against which the evaluator can be gauged. The reference dataset that forms the calibrator defines the expected behaviour of the evaluator.

The samples of the calibration dataset are similar to the to those of the demonstration dataset, consisting of score, response, and optional request and optional justification.

On the Calibrator page:

The calibration dataset can be imported on a file or typed in the editor.
A synthetic dataset can be generated, edited, and appended.

Calibration page enables viewing the actual performance of the evaluator on the samples of the calibrator dataset with the expected performance defined by the scores of the set.

On this page:

Total deviance, which quantifies the average magnitude of the errors between predicted values by an evaluator and the actual observed values, can be calculated. A lower total deviance indicates that the evaluator's predictions are closer to the actual outcomes, which signifies better performance of the evaluator. The total deviance is computed using the Root Mean Square method.
Deviations for individual samples of the dataset are displayed, enabling easy identification of weak points of the evaluator. If a particular sample has a high deviation, there are characteristics in the sample that confuse the evaluator.

How to improve the performance an evaluator

To improve the performance or 'calibrate' an evaluator, adjustments can be made to one or more of the three key components: the prompt, the demonstrations, and the model.

Effective strategies for this can be deduced by examining the calibration results. Inspecting the worst-performing samples, those with the largest deviations, can help identify the evaluator's weak points.

Then, one or more steps can be taken:

The instructions given in the prompt can be made more specific to adjust the behavior in the problem cases.
Modify demonstration content by adding examples similar to the problematic samples, which can enhance performance in these areas. Additional instructions can be added by including a justification to a demonstration. Note: Maintaining a sufficiently large calibration dataset reduces the risk of overfitting, i.e., producing an evaluator tailored to the calibration but lacking generalization.
The model can be changed. Overall performance can be improved by using a larger or otherwise better suited model, often at the cost of evaluation latency and price.

After each modification, it's advisable to recalculate the deviations to assess the direction and magnitude of the impact on performance.

Evaluator permissions

As evaluators are special type of skill, the permission controls that apply to all skills, apply to evaluator skills too.

List of Evaluators Maintained by Root Signals

Evaluators tagged with RAG Evaluator work properly when evaluating skills with reference variables. Alternatively, when not used to evaluate skill outputs, contexts parameter containing a set of documents as a list of strings—corresponding to the retrieved context data—must be passed.
Evaluators tagged with Ground Truth Evaluator can be used for evaluating tests sets that contain an expected_output column. When used through the SDK, expected_output parameter must be likewise passed.
Evaluators tagged with Function Call Evaluator can be used through SDK and require a functions parameter conforming to OpenAI compatible tools parameter to be passed.

Relevance Assesses the relevance of the response in relation to the request by evaluating accuracy, completeness, adherence to the prompt, and logical consistency, to determine whether responses remain directly on-topic and informative.
Safety for Children Checks the appropriateness of content for young audiences, focusing on avoiding language or themes that could be harmful or unsuitable for children, thus promoting safety and age-appropriateness.
Sentiment Recognition Identifies the emotional tone of the response, determining whether it is predominantly positive, negative, or neutral, which helps in gauging the overall sentiment conveyed by the content.
Coherence Assesses whether the response is logically consistent and contextually relevant, checking for smooth transitions between ideas and the maintenance of topic focus throughout the response.
Conciseness Measures the brevity and directness of the response, aiming to convey information efficiently without the use of unnecessary words, thus avoiding redundancy.
Engagingness Evaluates the ability of the response to engage readers, determining whether the content is interesting and likely to hold attention, or if it tends toward being boring or predictable.
Originality Checks the originality and creativity of the response, focusing on providing unique and inventive perspectives rather than relying on clichéd or derivative ideas.
Clarity Measures how easily the response can be understood on a first reading, aiming for ideas to be presented in a logical, straightforward manner with sufficient detail for clarity.
Precision Assesses the accuracy and specificity of the response, aiming for responses that are detailed and exact, with no room for ambiguities or overly general statements.
Completeness Evaluates how well the response addresses all aspects of the input request, ensuring that no important elements are overlooked and that comprehensive coverage is provided for multi-faceted queries or instructions.
Persuasiveness Evaluates the persuasiveness of the response by examining the strength, clarity, and convincingness of the main argument or point, including how well it is supported by evidence, examples, and logical reasoning.
Confidentiality Assesses the response for the presence of any personal details, private communications, sensitive business data, or any other information that should be kept confidential, focusing on maintaining privacy and security.
Harmlessness Assesses the harmlessness of the response by checking for the presence of hate speech, slurs, derogatory language, and any incitement or glorification of violence. It evaluates the overall potential of the content to cause harm or distress.
Formality Evaluates the formality of the response by considering factors such as word choice, sentence structure, tone, grammar, and overall style. This helps in matching the content to the expected level of formality for the context.
Politeness Assesses the politeness of the response by examining factors such as word choice, tone, phrasing, and the overall level of respect and courtesy demonstrated in the response.
Helpfulness Evaluates the helpfulness of the response by considering how useful, informative, and beneficial the text is to a reader seeking information. Helpful text provides clear, accurate, relevant, and comprehensive information to aid the reader's understanding and ability to take appropriate action.
Non-toxicity Assesses the non-toxicity of the response. Text that is benign and completely harmless receives high scores.
Faithfulness RAG Evaluator This corresponds to hallucination detection in RAG settings. Measures the factual consistency of the generated answer with respect to the context. It determines whether the response accurately reflects the information provided in the context. This is the high-accuracy variant of our set of Faithfulness evaluators.
Faithfulness-swift RAG Evaluator
This is the faster variant of our set of Faithfulness evaluators.
Answer Relevance Measures how relevant a response is with respect to the prompt/query. Completeness and conciseness of the response are considered.
Truthfulness RAG Evaluator Assesses factual accuracy by prioritizing context-backed claims over model knowledge, while preserving partial validity for logically consistent but unverifiable claims. Unlike Faithfulness, allows for valid model-sourced information beyond the context. This is the high-accuracy variant of our set of Truthfulness evaluators.
Truthfulness-swift RAG Evaluator This is the faster variant of our set of Truthfulness evaluators.
Quality of Writing - Professional Measures the quality of writing as a piece of academic or other professional text. It evaluates the formality, correctness, and appropriateness of the writing style, aiming to match professional standards.
Quality of Writing - Creative Measures the quality of writing as a piece of creative text. It evaluates the creativity, expressiveness, and originality of the content, focusing on its impact and artistic expression.
JSON Content Accuracy RAG Evaluator | Function Call Evaluator Checks if the content of the JSON response is accurate and matches the documents and instructions, verifying that the JSON data correctly represents the intended information.
JSON Property Completeness Function Call Evaluator Checks how many of the required properties are present in the JSON response, verifying that all necessary fields are included. This is a string (non-LLM) evaluator.
JSON Property Type Accuracy Function Call Evaluator Checks if the types of properties in the JSON response match the expected types, verifying that the data types are correct and consistent. This is a string (non-LLM) evaluator.
JSON Property Name Accuracy Function Call Evaluator Checks if the names of properties in the JSON response match the expected names, verifying that the field names are correct and standardized. This is a string (non-LLM) evaluator.
JSON Empty Values Ratio Function Call Evaluator Checks the portion of empty values in the JSON response, aiming to minimize missing information and ensure data completeness. This is a string (non-LLM) evaluator.
Answer Semantic Similarity Ground Truth Evaluator Measures the semantic similarity between the generated answer and the ground truth, helping to evaluate how well the response mirrors the expected response.
Answer Correctness Ground Truth Evaluator Measures the factual correspondence of the generated response against a user-supplied ground truth. It considers both semantic similarity and factual consistency.
Context Recall RAG Evaluator | Ground Truth Evaluator Measures whether the retrieved context provides sufficient information to produce the ground truth response, evaluating if the context is relevant and comprehensive according to the expected output.
Context Precision RAG Evaluator | Ground Truth Evaluator Measures the relevance of the retrieved contexts to the expected output.
Summarization Quality Measures the quality of text summarization with high weights for clarity, conciseness, precision, and completeness.
Translation Quality Quality of machine translation with high weights for accuracy, completeness, fluency, and cultural appropriateness.
Planning Efficiency Quality of planning of an AI agent with high weights for efficiency, effectiveness, and goal-orientation.
Information Density Information density of a response with high weights for concise, factual statements and penalizing vagueness, questions, or evasive answers.
Reading Ease Evaluates the text for ease of reading, focusing on simple language, clear sentence structures, and overall clarity.
Answer Willingness Answer willingness of a response with high weights for response presence, directness and penalty for response avoidance, refusal, or evasion.

Determinism

As our evaluators are LLM-judges, they are non-deterministic, i.e, the same input can result in slightly different score. We try to keep this fluctuation low. The expected standard deviations of each evaluator for 3 different dimensions are reported below: short/long context, single-turn / multi, low/high ground truth score:

Determinism Metrics

Version Control

Both ready-made Root Evaluators and your Custom Evaluators have version control. Normally, you can call an evaluator as :

client.evaluators.run(
    request="My internet is not working.",
    response="""
    I'm sorry to hear that your internet isn't working.
    Let's troubleshoot this step by step. What is your IP address?
    """,
    evaluator_id="bd789257-f458-4e9e-8ce9-fa6e86dc3fb9",  # e.g. corresponding to Relevance
)

and if you want to call a specific version, you can add:

evaluator_version_id="7c099204-4a41-4d56-b162-55aac24f6a47"

PreviousObjectives NextJudges

Last updated 21 days ago