Root Signals Docs
  • Intro
  • QUICK START
    • Getting started in 30 seconds
  • OVERVIEW
    • Why Anything?
    • Concepts
  • USAGE
    • Usage
      • Models
      • Objectives
      • Evaluators
      • Datasets
        • Dataset permissions
      • Execution, Auditability and Versioning
      • Access Controls & Roles
      • Lifecycle Management
    • Cookbook
      • Add a custom evaluator
        • Add a calibration set
      • Evaluate an LLM response
      • Use evaluators and RAG
      • Connect a model
      • SDK Examples
      • Poker app
  • Integrations
    • Haystack
    • LangGraph
    • LangChain
    • LlamaIndex
  • Frequently Asked Questions
  • Breaking Change Policy
  • RESOURCES
    • Python SDK
    • Github Repo
    • REST API
    • Root Signals MCP
Powered by GitBook
On this page
  • Why is Evals suddenly a thing now?
  • But there are LLM benchmarks, no?
Export as PDF
  1. OVERVIEW

Why Anything?

Why is Evals suddenly a thing now?

Evaluation has been critical in all Machine Learning/AI systems for decades but it was not a problem until generative AI. Because back then, we had our test sets with ground truth annotations and simply calculated proportions/ratios (e.g. accuracy, precision, recall, AUROC, f1-score etc.) or well defined metric formulas (e.g. mean absolute error, some custom metric etc.) to estimate the performance of our systems. If we were satisfied with the accuracy & latency, we deployed our AI models to production.

We can not do that anymore because LLMs

  1. output free text instead of pre-defined categories or numerical values

  2. are non-deterministic

  3. are instructable by semantic guidance in other words they have a prompt. Their behaviour depends on the which is difficult to predict beforehand.

Therefore, applications powered by LLMs are inherently unpredictable, unreliable, weird, and in general hard to control.

This is the main blocker of large scale adoption and value creation with Generative AI. To overcome this, we need a new way of measuring, monitoring, and guardrailing AI systems.

But there are LLM benchmarks, no?

Yes, there are numerous LLM benchmarks and leaderboards, yet

  • They measure LLMs, not LLM applications. Benchmarks are interested in low-level academic metrics that are far away from business goals.

  • Tasks and samples in those benchmarks do not reflect real-life use cases. For example, multiple choice high school geometry question answering performance is not relevant when one is developing a customer support chatbot that should not hallucinate.

  • Benchmarks are full of low quality, incomplete, ambiguous, and erroneous samples

  • Data leakage is rampant. Consciously or not, test samples or slight variations of them are often leaked into the training data.

  • Benchmarks are not always transparent about what kind of settings they use in the experiments (e.g. temperature, zero-shot or few-shot, prompts) and are hard to replicate.

In short,

You want to measure and monitor your specific LLM-powered automation, not the generic academic capabilities of an LLM.

PreviousGetting started in 30 secondsNextConcepts

Last updated 1 month ago