Evaluate an LLM response

Building production-ready and reliable AI applications requires safeguards provided by an evaluation layer. LLM responses can vary drastically based on even the slightest input changes.

Root Signals provides a robust set of fundamental evaluators suitable for any LLM-based application.

Setup

You need a few examples of LLM outputs (text). Those can be from any source, such as a summarization output on a given topic.

Running an evaluator through the UI

The evaluators listing page shows all evaluators at your disposal. Root Signals provides the base evaluators, but you can also build custom evaluators for specific needs.

Let's start with the Precision evaluator. Based on the text you want to evaluate, feel free to try other evaluators as well.

Click on the Precision evaluator and then click on the Execute skill button.
Paste the text you want to evaluate into the output field and click Execute. You will get a numeric score based on the metric the evaluator is evaluating and the text to evaluate.

An individual score is not very interesting. The power of evaluation lies in integrating evaluators into an LLM application.

Integrating evaluators as part of existing AI automation

Integrating the evaluators as part of your LLM application is a more systematic approach to evaluating LLM outputs. That way, you can compare the scores over time and take action based on the evaluation results.

The Precision evaluator details page contains information on how to add it to your application. First, you must fetch a Root Signals API key and then execute the example cURL command.

Go to the Precision evaluator details page
Click on the Add to your application link
Copy the cURL command

You can omit the request field from the data payload and add the text to evaluate in the response field. Example (cURL)

curl 'https://api.app.rootsignals.ai/v1/skills/evaluator/execute/767bdd49-5f8c-48ca-8324-dfd6be7f8a79/' \
                                                   -H 'authorization: Api-Key <YOUR API KEY>' \
                                                   -H 'content-type: application/json' \
                                                   --data-raw '{"response":"While large language models (LLMs) have many powerful applications, there are scenarios where they are not as effective or suitable. Here are some use cases where LLMs may not be useful:\n\nReal-Time Critical Systems:\nLLMs are not ideal for applications requiring real-time, critical decision-making, such as air traffic control, medical emergency systems, or autonomous vehicle navigation, where delays or errors can have severe consequences.\n\nHighly Specialized Expert Tasks:\nTasks that require deep domain-specific expertise, such as advanced scientific research, complex legal analysis, or detailed medical diagnosis, may be beyond the capabilities of LLMs due to the need for precise, highly specialized knowledge and judgment."}'

Example (Python SDK)

# pip install root-signals
from root import RootSignals

client = RootSignals(api_key="<YOUR API KEY>")
client.evaluators.Precision(
    response="While large language models (LLMs) have many powerful applications, there are scenarios where they are not as effective or suitable. Here are some use cases where LLMs may not be useful:\n\nReal-Time Critical Systems:\nLLMs are not ideal for applications requiring real-time, critical decision-making, such as air traffic control, medical emergency systems, or autonomous vehicle navigation, where delays or errors can have severe consequences.\n\nHighly Specialized Expert Tasks:\nTasks that require deep domain-specific expertise, such as advanced scientific research, complex legal analysis, or detailed medical diagnosis, may be beyond the capabilities of LLMs due to the need for precise, highly specialized knowledge and judgment."
)

PreviousAdd a calibration test set NextUse evaluators and RAG

Last updated 8 months ago