Principles
The Root Signals platform is built upon foundational principles that ensure semantic rigor, measurement accuracy, and operational flexibility. These principles guide the design and implementation of all platform features, from evaluator creation to production deployment.
1. Separation of Concerns: Objectives and Implementations
At the core of Root Signals lies a fundamental distinction between what should be measured and how it is measured:
An Objective defines the precise semantic criteria and measurement scale for evaluation.
An Evaluator represents an implementation that can meet these criteria.
This separation enables:
Multiple evaluator implementations for the same objective
Evolution of measurement techniques without changing business requirements
Clear communication between stakeholders about evaluation goals
Standardized benchmarking across different implementations
In practice, an objective consists of an Intent (describing the purpose and goal) and a Calibrator (the score-annotated dataset providing ground truth examples). The evaluator's function—comprised of prompt, demonstrations, and model—represents just one possible implementation of that objective.
2. Calibration and Measurement Accuracy
Every measurement instrument requires calibration against known standards. In Root Signals, evaluators undergo rigorous calibration to ensure their scores align with human judgment baselines. This process involves:
Calibration datasets: Ground truth examples with expected scores, including optional justifications that illustrate the rationale for specific scores
Deviation analysis: Quantitative assessment using Root Mean Square to calculate total deviance between predicted and actual values
Continuous refinement: Iterative improvement based on calibration results, focusing on samples with highest deviation
Version control: Tracking evaluator performance across iterations
Production feedback loops: Adding real execution samples to calibration sets for ongoing improvement
The calibration principle acknowledges that LLM-based evaluators are probabilistic instruments requiring empirical validation. Calibration samples must be strictly separated from demonstration samples to ensure unbiased measurement.
3. Metric-First Architecture
All evaluations in Root Signals are fundamentally metric evaluations, producing normalized scores between 0 and 1. This universal approach provides:
Generalizability: Any evaluation concept can be expressed as a continuous metric
Optimization capability: Numeric scores enable gradient-based optimization
Fuzzy semantics handling: Real-world concepts exist on spectrums rather than binary states
Composability: Metrics can be combined, weighted, and aggregated
This principle recognizes that language and meaning are inherently fuzzy, requiring nuanced measurement approaches. Every evaluator maps text to a numeric value, enabling consistent measurement across diverse dimensions like coherence (logical consistency), conciseness (brevity without information loss), or harmlessness (absence of harmful content).
4. Model Agnosticism and EvalOps
The platform maintains strict independence from specific model implementations, both for operational models (those being evaluated) and judge models (those performing evaluation). This enables:
Model comparison: Evaluate multiple models using identical criteria
Performance optimization: Select models based on accuracy, cost, and latency trade-offs
Future-proofing: Integrate new models as they become available
Vendor independence: Avoid lock-in to specific model providers
Changes in either operational or judge models can be measured precisely, enabling data-driven model selection. The platform supports API-based models (OpenAI, Anthropic), open-source models (Llama, Mistral), and custom locally-running models. Organization administrators control model availability, ensuring governance while maintaining flexibility.
5. Interoperability and Portability
Evaluation definitions must transcend platform boundaries through standardized, interchangeable formats. This principle ensures:
Clear entity references: Distinguish between evaluator references and definitions
Objective portability: Move evaluation criteria between systems
Implementation flexibility: Express objectives independent of specific implementations
Semantic preservation: Maintain meaning across different contexts
The distinction between referencing an entity and describing it enables robust system integration.
6. Dimensional Decomposition
Complex evaluation predicates can be expressed either as a single (inherently composite) evaluator or decomposed into a vector of multiple independent evaluators that effectively indicate a dimension of measurement. This principle provides:
Granular calibration: Each dimension can be independently calibrated
Modular development: Evaluators can be developed and tested separately
Precise diagnostics: Identify which specific dimensions need improvement
Flexible composition: Combine dimensions based on use case requirements
For example, "helpfulness" might decompose into truthfulness, relevance, completeness, and clarity—each with its own evaluator and calibration set. This decomposition extends to specialized domains: RAG evaluators (faithfulness, context recall), structured output evaluators (JSON accuracy, property completeness), and task-specific evaluators (summarization quality, translation accuracy), etc. Judges represent practical implementations of this principle, stacking multiple evaluators to achieve comprehensive assessment.
7. Operational Objectives
Similar to evaluation objectives, an operational task should have an objective that defines its success criteria independent of implementation. An operational objective consists of:
Intent: The business purpose of the operation.
Success criteria: The set of evaluators that together define acceptable outcomes and what good looks like.
This set of evaluators can be captured in a judge, while the intent is capture in the Judge intent description.
Implementation independence: Multiple ways to achieve the objective
This principle extends the objective/implementation separation to operational workflows, enabling outcome-based task definition rather than prescriptive implementation.
8. Orthogonality of the Root Evaluator Stack
The Root Evaluators are designed as a set of primitive, orthogonal measurement dimensions that minimize overlap while maximizing coverage. This principle ensures:
Minimal redundancy: Each evaluator measures a distinct semantic dimension
Maximal composability: Evaluators combine cleanly without interference
Complete coverage: The primitive set spans the space of common evaluation needs
Predictable composition: Combining evaluators yields intuitive results
This orthogonality enables judges to be constructed as precise combinations of primitive evaluators. For instance, "professional communication quality" might combine:
Clarity (information structure)
Formality (tone appropriateness)
Precision (technical accuracy)
Grammar correctness (linguistic quality)
Each dimension contributes independently, allowing fine-grained control over the composite evaluation. The orthogonal design prevents double-counting of features and ensures that improving one dimension doesn't inadvertently degrade another. In cases where one can arguably interpret the evaluator in several different ways, we split these into separate objectives and corresponding Root Evaluators, such as in the case of relevance which may or may not be interpreted to include truthfulness (for instance, in a factual context, an untrue statement is arguably irrelevant, whereas in a story or hypothetical context, this may not be the case).
Practical Implications
These principles manifest throughout the Root Signals platform:
Evaluator creation starts with objective definition before implementation
Calibration workflows ensure measurement reliability
Judge composition allows stacking evaluators for complex assessments
Version control tracks both objectives and implementations
API design separates concerns between what and how
By adhering to these principles, Root Signals provides a semantically rigorous foundation for AI evaluation that scales from simple metrics to complex operational workflows.
Last updated