Lifecycle Management
In Root Signals, evaluation is treated as a procedure to compute a metric grounded on a human-defined criteria, emphasizing the separation of utility grounding (Objective) and implementation (Evaluator function).
This lets the criteria and implementations for the evaluations evolve in two separate controlled and trackable tracks, each with different version control logic.
Metric evaluators are different from other entities in the world, and simply treating them as "grounded in data", on one hand, or as "tests", on the other, misses some of their core properties.
In Root Signals, an Objective consist of
Intent that is human-defined and human-understanable, corresponding to the precise attribute being meausred.
Calibration data set that defines, via examples, the structure and scale of those criteria.
An Evaluator function consists of:
Predicate that uniquely specifies the task to the LLMs that power the evaluator
LLM
In-context examples (demonstrations)
[Optionally] Associated data files
An Evaluator function is typically associated with an Objective that connects it to business / contextual value, but the two have no causal connection.
Root Signals platform itself handles:
Semantic quantization: Guaranteeing the predicates are consistently mapped to metrics (for supported LLMs). This lets us abstract the predicates out of the boilerplate prompts needed to yield robust metrics
Version control of evaluator implementations
Maintenance of relationships* between Objectives and Evaluators
Monitoring
E.g. If an Objective is changed (e.g. it's calibration dataset is altered), it is not a priori clear if the related criteria, which then affect all evaluator variants using the Objective, rendering measurements backwards-incompatible. Hence, the best-practise enforced by Root Signals platform is to create an entirely new Objective, so that it is clear the criteria have changed. This can be bypassed, however, when the Objective is still in formation stage and/or you accept that the criteria will change over time.
Over time, improved evaluator functions will be created (including but not limited to model updates) to improve upon the Objective targets. On the other hand, Objectives tend to branch and become more precise over time, passing the burden of resolving the question of "is this still the same Objective" to the users, while providing the software support to make those calls either way in an auditable and controllable manner.
Last updated