Why Anything?

Why is Evals suddenly a thing now?

Evaluation has been critical in all Machine Learning/AI systems for decades but it was not a problem until generative AI. Because back then, we had our test sets with ground truth annotations and simply calculated proportions/ratios (e.g. accuracy, precision, recall, AUROC, f1-score etc.) or well defined metric formulas (e.g. mean absolute error, some custom metric etc.) to estimate the performance of our systems. If we were satisfied with the accuracy & latency, we deployed our AI models to production.

We can not do that anymore because LLMs

output free text instead of pre-defined categories or numerical values
are non-deterministic
are instructable by semantic guidance in other words they have a prompt. Their behaviour depends on the which is difficult to predict beforehand.

Therefore, applications powered by LLMs are inherently unpredictable, unreliable, weird, and in general hard to control.

This is the main blocker of large scale adoption and value creation with Generative AI. To overcome this, we need a new way of measuring, monitoring, and guardrailing AI systems.

But there are LLM benchmarks, no?

Yes, there are numerous LLM benchmarks and leaderboards, yet

They measure LLMs, not LLM applications. Benchmarks are interested in low-level academic metrics that are far away from business goals.
Tasks and samples in those benchmarks do not reflect real-life use cases. For example, multiple choice high school geometry question answering performance is not relevant when one is developing a customer support chatbot that should not hallucinate.
Benchmarks are full of low quality, incomplete, ambiguous, and erroneous samples
Data leakage is rampant. Consciously or not, test samples or slight variations of them are often leaked into the training data.
Benchmarks are not always transparent about what kind of settings they use in the experiments (e.g. temperature, zero-shot or few-shot, prompts) and are hard to replicate.

In short,

You want to measure and monitor your specific LLM-powered automation, not the generic academic capabilities of an LLM.

PreviousEvaluator Portfolio NextConcepts

Last updated 3 months ago

Why Anything?

Why is Evals suddenly a thing now?

Therefore, applications powered by LLMs are inherently unpredictable, unreliable, weird, and in general hard to control.

But there are LLM benchmarks, no?

You want to measure and monitor your specific LLM-powered automation, not the generic academic capabilities of an LLM.