Principles

The Root Signals platform is built upon foundational principles that ensure semantic rigor, measurement accuracy, and operational flexibility. These principles guide the design and implementation of all platform features, from evaluator creation to production deployment.

1. Separation of Concerns: Objectives and Implementations

At the core of Root Signals lies a fundamental distinction between what should be measured and how it is measured:

  • An Objective defines the precise semantic criteria and measurement scale for evaluation.

  • An Evaluator represents an implementation that can meet these criteria.

This separation enables:

  • Multiple evaluator implementations for the same objective

  • Evolution of measurement techniques without changing business requirements

  • Clear communication between stakeholders about evaluation goals

  • Standardized benchmarking across different implementations

In practice, an objective consists of an Intent (describing the purpose and goal) and a Calibrator (the score-annotated dataset providing ground truth examples). The evaluator's function—comprised of prompt, demonstrations, and model—represents just one possible implementation of that objective.

2. Calibration and Measurement Accuracy

Every measurement instrument requires calibration against known standards. In Root Signals, evaluators undergo rigorous calibration to ensure their scores align with human judgment baselines. This process involves:

  • Calibration datasets: Ground truth examples with expected scores, including optional justifications that illustrate the rationale for specific scores

  • Deviation analysis: Quantitative assessment using Root Mean Square to calculate total deviance between predicted and actual values

  • Continuous refinement: Iterative improvement based on calibration results, focusing on samples with highest deviation

  • Version control: Tracking evaluator performance across iterations

  • Production feedback loops: Adding real execution samples to calibration sets for ongoing improvement

The calibration principle acknowledges that LLM-based evaluators are probabilistic instruments requiring empirical validation. Calibration samples must be strictly separated from demonstration samples to ensure unbiased measurement.

3. Metric-First Architecture

All evaluations in Root Signals are fundamentally metric evaluations, producing normalized scores between 0 and 1. This universal approach provides:

  • Generalizability: Any evaluation concept can be expressed as a continuous metric

  • Optimization capability: Numeric scores enable gradient-based optimization

  • Fuzzy semantics handling: Real-world concepts exist on spectrums rather than binary states

  • Composability: Metrics can be combined, weighted, and aggregated

This principle recognizes that language and meaning are inherently fuzzy, requiring nuanced measurement approaches. Every evaluator maps text to a numeric value, enabling consistent measurement across diverse dimensions like coherence (logical consistency), conciseness (brevity without information loss), or harmlessness (absence of harmful content).

4. Model Agnosticism and EvalOps

The platform maintains strict independence from specific model implementations, both for operational models (those being evaluated) and judge models (those performing evaluation). This enables:

  • Model comparison: Evaluate multiple models using identical criteria

  • Performance optimization: Select models based on accuracy, cost, and latency trade-offs

  • Future-proofing: Integrate new models as they become available

  • Vendor independence: Avoid lock-in to specific model providers

Changes in either operational or judge models can be measured precisely, enabling data-driven model selection. The platform supports API-based models (OpenAI, Anthropic), open-source models (Llama, Mistral), and custom locally-running models. Organization administrators control model availability, ensuring governance while maintaining flexibility.

5. Interoperability and Portability

Evaluation definitions must transcend platform boundaries through standardized, interchangeable formats. This principle ensures:

  • Clear entity references: Distinguish between evaluator references and definitions

  • Objective portability: Move evaluation criteria between systems

  • Implementation flexibility: Express objectives independent of specific implementations

  • Semantic preservation: Maintain meaning across different contexts

The distinction between referencing an entity and describing it enables robust system integration.

6. Dimensional Decomposition

Complex evaluation predicates can be expressed either as a single (inherently composite) evaluator or decomposed into a vector of multiple independent evaluators that effectively indicate a dimension of measurement. This principle provides:

  • Granular calibration: Each dimension can be independently calibrated

  • Modular development: Evaluators can be developed and tested separately

  • Precise diagnostics: Identify which specific dimensions need improvement

  • Flexible composition: Combine dimensions based on use case requirements

For example, "helpfulness" might decompose into truthfulness, relevance, completeness, and clarity—each with its own evaluator and calibration set. This decomposition extends to specialized domains: RAG evaluators (faithfulness, context recall), structured output evaluators (JSON accuracy, property completeness), and task-specific evaluators (summarization quality, translation accuracy), etc. Judges represent practical implementations of this principle, stacking multiple evaluators to achieve comprehensive assessment.

7. Operational Objectives

Similar to evaluation objectives, an operational task should have an objective that defines its success criteria independent of implementation. An operational objective consists of:

  • Intent: The business purpose of the operation.

  • Success criteria: The set of evaluators that together define acceptable outcomes and what good looks like.

This set of evaluators can be captured in a judge, while the intent is capture in the Judge intent description.

  • Implementation independence: Multiple ways to achieve the objective

This principle extends the objective/implementation separation to operational workflows, enabling outcome-based task definition rather than prescriptive implementation.

8. Orthogonality of the Root Evaluator Stack

The Root Evaluators are designed as a set of primitive, orthogonal measurement dimensions that minimize overlap while maximizing coverage. This principle ensures:

  • Minimal redundancy: Each evaluator measures a distinct semantic dimension

  • Maximal composability: Evaluators combine cleanly without interference

  • Complete coverage: The primitive set spans the space of common evaluation needs

  • Predictable composition: Combining evaluators yields intuitive results

This orthogonality enables judges to be constructed as precise combinations of primitive evaluators. For instance, "professional communication quality" might combine:

  • Clarity (information structure)

  • Formality (tone appropriateness)

  • Precision (technical accuracy)

  • Grammar correctness (linguistic quality)

Each dimension contributes independently, allowing fine-grained control over the composite evaluation. The orthogonal design prevents double-counting of features and ensures that improving one dimension doesn't inadvertently degrade another. In cases where one can arguably interpret the evaluator in several different ways, we split these into separate objectives and corresponding Root Evaluators, such as in the case of relevance which may or may not be interpreted to include truthfulness (for instance, in a factual context, an untrue statement is arguably irrelevant, whereas in a story or hypothetical context, this may not be the case).

Practical Implications

These principles manifest throughout the Root Signals platform:

  • Evaluator creation starts with objective definition before implementation

  • Calibration workflows ensure measurement reliability

  • Judge composition allows stacking evaluators for complex assessments

  • Version control tracks both objectives and implementations

  • API design separates concerns between what and how

By adhering to these principles, Root Signals provides a semantically rigorous foundation for AI evaluation that scales from simple metrics to complex operational workflows.

Last updated