Evaluators
An evaluator is a metric for a piece of text that maps a string originating from a language model to a numeric value between 0 and 1. For example, an evaluator could measure the "Truthfulness" of the generated text. When coupled with a threshold value, they instead serve as validators for non-evaluator skills.
Root Signals provides a rich collection of pre-built evaluators that you can use, such as:
Quality of professional writing: checks how grammatically correct, clear, concise and precise the output is
Toxicity Detection: Identifies any toxic or inappropriate content in the skill's output.
Faithfulness: Verifies the accuracy of information provided by the skill.
Sentiment Analysis: Determines the overall sentiment (positive, negative, or neutral) of the skill's output.
You can also define your own custom evaluators.
Evaluators are a special type of skill in Root Signals that assess the performance and quality of the outputs of operational skills. Custom evaluators are similar in structure to normal skills, consisting of a name, objective, and function.
Objective
The objective of an evaluator skill consists of two components:
Intent: This describes the purpose and goal of the evaluator skill, specifying what it aims to evaluate or assess in the outputs of other skills.
Calibrator: The calibrator serves a similar role to test data in operational skills. It provides the ground truth set of appropriate numeric values for specific input-output pairs that defines the intended behavior of the evaluator. This set 'calibrates' its evaluation criteria and ensures consistent and accurate assessments.
Function
The function of an evaluator skill consists of three components:
Prompt
Demonstrations
Model
Prompt
The prompt (or instruction) defines the instructions and variable content the evaluator prompts a large language model with. It should clearly specify the criteria and guidelines for assessing the quality and performance of skill outputs.
Note: During skill execution, the prompt defined by the user is appended to a more general template containing instructions responsible for guiding and optimizing the behavior of the evaluator. Thus the user does not have to bother with with generic instructions such as "Give a score between 0 and 1". It is sufficient to describe the evaluation criteria of the specific evaluator at hand.
Example: How well does the {{output}} adhere to instructions given in {{input}}.
Variables in an evaluator skill
All variable types are available for an evaluator skill. However, some restrictions apply.
The prompt of an evaluator must contain a special variable named
output
that represents the LLM output to be evaluated.It can also contain a special variable named
input
if the prompt that produced the input is considered relevant for evaluation.
output
and input
can be either input or reference variables. In the latter case the variable is associated with a dataset that can be searched for contextual information to support the evaluation, using Retrieval Augment Generation. In addition, a dataloader variable can be inserted for dynamic information retrieval.
Demonstrations
A demonstration is a sample consisting of an input-output -pair (or just output, if input is not considered necessary for evaluation), an expected score, and optional justification. Demonstrations exemplify the expected behavior of the evaluator skill. Demonstration is provided to the model, and therefore must be strictly separated from any evaluation or calibration of the related AI skill.
A justification illustrates the rationale for the given score. Justification can be helpful when the reason for a specific score is not obvious, allowing the model to pay attention to relevant aspects of the evaluated response and tackle ambiguous cases in a nuanced way.
Example:
A sample demonstration for an evaluator for determining if content is safe for children.
Model
The model refers to the specific language model or engine used to execute the evaluator skill. It should be chosen based on its capabilities and suitability for the evaluation task..
Calibration
Unlike operational skills, evaluator skills do not have validators associated with them.
Calibration is the response to the naturally arising question: How can we trust evaluation results? The calibrator provides a way to quantify the performance of the evaluator by providing the ground truth against which the evaluator can be gauged. The reference dataset that forms the calibrator defines the expected behavior of the evaluator.
The samples of the calibration dataset are similar to the to those of the demonstration dataset, consisting of score, output, and optional input and optional justification.
On the Calibrator page:
The calibration dataset can be imported on a file or typed in the editor.
A synthetic dataset can be generated, edited, and appended.
Calibration page enables viewing the actual performance of the evaluator on the samples of the calibrator dataset with the expected performance defined by the scores of the set.
On this page:
Total deviance, which quantifies the average magnitude of the errors between predicted values by an evaluator and the actual observed values, can be calculated. A lower total deviance indicates that the evaluator's predictions are closer to the actual outcomes, which signifies better performance of the evaluator. The total deviance is computed using the Root Mean Square method.
Deviations for individual samples of the dataset are displayed, enabling easy identification of weak points of the evaluator. If a particular sample has a high deviation, there are characteristics in the sample that confuse the evaluator.
How to improve the performance an evaluator
To improve the performance or 'calibrate' an evaluator, adjustments can be made to one or more of the three key components: the prompt, the demonstrations, and the model.
Effective strategies for this can be deduced by examining the calibration results. Inspecting the worst-performing samples, those with the largest deviations, can help identify the evaluator's weak points.
Then, one or more steps can be taken:
The instructions given in the prompt can be made more specific to adjust the behavior in the problem cases.
Modify demonstration content by adding examples similar to the problematic samples, which can enhance performance in these areas. Additional instructions can be added by including a justification to a demonstration. Note: Maintaining a sufficiently large calibration dataset reduces the risk of overfitting, i.e., producing an evaluator tailored to the calibration but lacking generalization.
The model can be changed. Overall performance can be improved by using a larger or otherwise better suited model, often at the cost of evaluation latency and price.
After each modification, it's advisable to recalculate the deviations to assess the direction and magnitude of the impact on performance.
Evaluator skill permissions
As evaluators are special type of skill, the permission controls that apply to all skills, apply to evaluator skills too.
List of Evaluators Maintained by Root Signals
Evaluators tagged with RAG Evaluator work properly when evaluating skills with reference variables. Alternatively, when not used to evaluate skill outputs,
context
parameter containing a set of documents as a list of strings—corresponding to the retrieved context data—must be passed.Evaluators tagged with Ground Truth Evaluator can be used for evaluating tests sets that contain an
expected output
column. When used through the SDK,expected output
parameter must be likewise passed.Evaluators tagged with Function Call Evaluator can be used through SDK and require a
functions
parameter conforming to OpenAI compatible tools parameter to be passed.
Relevance Assesses the relevance of the output in relation to the input by evaluating accuracy, completeness, adherence to the prompt, and logical consistency, to determine whether responses remain directly on-topic and informative.
Safety for Children Checks the appropriateness of content for young audiences, focusing on avoiding language or themes that could be harmful or unsuitable for children, thus promoting safety and age-appropriateness.
Sentiment Recognition Identifies the emotional tone of the output, determining whether it is predominantly positive, negative, or neutral, which helps in gauging the overall sentiment conveyed by the content.
Coherence Assesses whether the output is logically consistent and contextually relevant, checking for smooth transitions between ideas and the maintenance of topic focus throughout the response.
Conciseness Measures the brevity and directness of the output, aiming to convey information efficiently without the use of unnecessary words, thus avoiding redundancy.
Engagingness Evaluates the ability of the output to engage readers, determining whether the content is interesting and likely to hold attention, or if it tends toward being boring or predictable.
Originality Checks the originality and creativity of the output, focusing on providing unique and inventive perspectives rather than relying on clichéd or derivative ideas.
Clarity Measures how easily the output can be understood on a first reading, aiming for ideas to be presented in a logical, straightforward manner with sufficient detail for clarity.
Precision Assesses the accuracy and specificity of the output, aiming for responses that are detailed and exact, with no room for ambiguities or overly general statements.
Persuasiveness Evaluates the persuasiveness of the output by examining the strength, clarity, and convincingness of the main argument or point, including how well it is supported by evidence, examples, and logical reasoning.
Confidentiality Assesses the output for the presence of any personal details, private communications, sensitive business data, or any other information that should be kept confidential, focusing on maintaining privacy and security.
Harmlessness Assesses the harmlessness of the output by checking for the presence of hate speech, slurs, derogatory language, and any incitement or glorification of violence. It evaluates the overall potential of the content to cause harm or distress.
Formality Evaluates the formality of the output by considering factors such as word choice, sentence structure, tone, grammar, and overall style. This helps in matching the content to the expected level of formality for the context.
Politeness Assesses the politeness of the output by examining factors such as word choice, tone, phrasing, and the overall level of respect and courtesy demonstrated in the response.
Helpfulness Evaluates the helpfulness of the output by considering how useful, informative, and beneficial the text is to a reader seeking information. Helpful text provides clear, accurate, relevant, and comprehensive information to aid the reader's understanding and ability to take appropriate action.
Non-toxicity Assesses the non-toxicity of the output. Text that is benign and completely harmless receives high scores.
Faithfulness RAG Evaluator Measures the factual consistency of the generated answer with respect to the retrieved context. It determines whether the response accurately reflects the information provided in the context.
Answer Relevance Measures how relevant a response is with respect to the prompt. Completeness and conciseness of the response are considered.
Truthfulness RAG evaluator Measures the factual consistency of the generated answer against the given context and general knowledge. It examines whether the output is faithful to the context and does not contradict general knowledge of the used model.
Quality of Writing - Professional Measures the quality of writing as a piece of academic or other professional text. It evaluates the formality, correctness, and appropriateness of the writing style, aiming to match professional standards.
Quality of Writing - Creative Measures the quality of writing as a piece of creative text. It evaluates the creativity, expressiveness, and originality of the content, focusing on its impact and artistic expression.
JSON Content Accuracy RAG evaluator Function Call Evaluator Checks if the content of the JSON output is accurate and matches the documents and instructions, verifying that the JSON data correctly represents the intended information.
JSON Property Completeness Function Call Evaluator Checks how many of the required properties are present in the JSON output, verifying that all necessary fields are included. This is a string (non-LLM) evaluator.
JSON Property Type Accuracy Function Call Evaluator Checks if the types of properties in the JSON output match the expected types, verifying that the data types are correct and consistent. This is a string (non-LLM) evaluator.
JSON Property Name Accuracy Function Call Evaluator Checks if the names of properties in the JSON output match the expected names, verifying that the field names are correct and standardized. This is a string (non-LLM) evaluator.
JSON Empty Values Ratio Function Call Evaluator Checks the portion of empty values in the JSON output, aiming to minimize missing information and ensure data completeness. This is a string (non-LLM) evaluator.
Answer Semantic Similarity Ground Truth Evaluator Measures the semantic similarity between the generated answer and the ground truth, helping to evaluate how well the response mirrors the expected output.
Answer Correctness Ground Truth Evaluator Measures the factual correspondence of the generated response against a user-supplied ground truth. It considers both semantic similarity and factual consistency.
Context Recall RAG Evaluator Ground Truth Evaluator Measures whether the retrieved context provides sufficient information to produce the ground truth response, evaluating if the context is relevant and comprehensive according to the expected output.
Last updated