Frequently Asked Questions
Terminology
What is Intent for?
Intent is the high-level, human-understandable description of the attribute an Evaluator measures. For example: “To measure how clearly the returns handler explains the 20% discount offer on the next purchase”.
What are Datasets?
Datasets allow you to bring test data for benchmarking (Root & Custom) and optimizing (Custom) evaluators.
Behaviour
Does Intent change the behaviour of the evaluator?
No. Evaluator Intent does not alter the evaluator behaviour.
Does Calibration change the behaviour of the evaluator?
No. Calibration is for benchmarking (testing) the evaluators to understand whether they are "calibrated" to your expected/desired behaviour or not. Calibration samples do not alter the behaviour of the evaluators.
How do Demonstrations work?
Demonstrations are used as in-context few-shot samples combined with our well-tuned meta-prompt. They are not utilized for supervised fine-tuning (SFT).
Usage
Our stack is not in Python, can we still use Root Signals?
Absolutely. We have a REST API that you can run from your favourite tech stack.
Do I need to have Calibrations for all Custom Evaluators?
You do not have to bring Calibration samples but we strongly recommend at least a handful of them in order to understand the behaviour of the evaluators.
Can I change the behaviour of the evaluator by bringing labeled data?
You can change the behaviour of your Custom Evaluators by bringing annotated samples as Demonstrations. Behaviour of Root Evaluators can not be altered.
If we already have a ground truth expected output, can we use your evaluators?
Yes. Various evaluators from us support reference-based evaluations where you can bring your ground truth expected responses. See our evaluator catalogue here.
How can I differentiate evaluations and related statistics for different applications (or versions) of mine?
You can use arbitrary tags for evaluation executions. See the example here.
Can I integrate Root Signals evaluators to experiment tracking tools such as MLflow etc.?
Yes. Our evaluators return a structured response (e.g. a dictionary) with scores, justifications, tags etc. These results can be logged to any experiment tracking system or database similar to any other metric, metadata, or attribute.
Models
Last updated