Comprehensively Test Your LLM Code

Overview

Root Signals provides a multi-dimensional testing framework that ensures your LLM applications perform reliably across response quality, security, performance, and messaging alignment. This systematic approach helps you identify and prevent issues before they impact production.

Testing Dimensions

1. Response Quality

Correctness and Accuracy

Factual accuracy validation
Context relevance assessment
Coherence and consistency checks
Completeness verification

Implementation:

from root import RootSignals

client = RootSignals(api_key="your-api-key")

# Test response quality with multiple evaluators
relevance_result = client.evaluators.Relevance(
    request="What is the capital of France?",
    response="The capital of France is Paris, which is located in the north-central part of the country."
)

coherence_result = client.evaluators.Coherence(
    request="Explain machine learning",
    response="Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed."
)

completeness_result = client.evaluators.Completeness(
    request="List the benefits of renewable energy",
    response="Renewable energy reduces carbon emissions, lowers long-term costs, and provides energy independence."
)

2. Security & Privacy

Content Safety

Harmlessness validation
Toxicity detection
Child safety assessment

Implementation:

# Security-focused evaluators
safety_result = client.evaluators.Harmlessness(
    request="How do I protect my personal information online?",
    response="To protect your personal information online, use strong passwords, enable two-factor authentication, and be cautious about sharing sensitive data."
)

toxicity_result = client.evaluators.Non_toxicity(
    request="What do you think about this situation?",
    response="I understand your frustration, and I'd be happy to help you find a solution."
)

child_safety_result = client.evaluators.Safety_for_Children(
    request="Tell me about animals",
    response="Animals are fascinating creatures that live in many different environments around the world."
)

3. Performance & Effectiveness

Response Quality Metrics

Helpfulness assessment
Clarity evaluation
Precision measurement

Implementation:

# Performance-focused evaluators
helpfulness_result = client.evaluators.Helpfulness(
    request="I need help setting up my email",
    response="I'd be happy to help you set up your email. First, let's identify which email provider you're using..."
)

clarity_result = client.evaluators.Clarity(
    request="Explain quantum computing",
    response="Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously, enabling parallel processing of information."
)

precision_result = client.evaluators.Precision(
    request="What is the population of Tokyo?",
    response="The population of Tokyo is approximately 14 million people in the metropolitan area."
)

4. Messaging Alignment

Communication Style

Tone and formality validation
Politeness assessment
Persuasiveness measurement

Implementation:

# Messaging alignment evaluators
politeness_result = client.evaluators.Politeness(
    request="I want to return this product",
    response="I'd be happy to help you with your return. Let me walk you through the process."
)

formality_result = client.evaluators.Formality(
    request="Please provide the quarterly report",
    response="The quarterly report has been prepared and is attached for your review."
)

persuasiveness_result = client.evaluators.Persuasiveness(
    request="Why should I choose your service?",
    response="Our service offers 24/7 support, competitive pricing, and a proven track record of customer satisfaction."
)

Testing Approaches

Single Evaluator Testing

Basic Evaluation

# Test a single response with one evaluator
result = client.evaluators.Truthfulness(
    request="What was the revenue in Q1 2023?",
    response="The revenue in Q1 2023 was 5.2 million USD.",
    contexts=[
        "Financial statement of 2023: Q1 revenue was 5.2M USD",
        "2023 revenue and expenses report"
    ]
)

print(f"Score: {result.score}")
print(f"Justification: {result.justification}")

Multi-Evaluator Testing with Judges

Judge-Based Evaluation

# Use judges to run multiple evaluators together
judge_result = client.judges.run(
    judge_id="your-judge-id",
    request="What are the benefits of our product?",
    response="Our product offers excellent value, superior quality, and outstanding customer support."
)

# Process multiple evaluator results
for eval_result in judge_result.evaluator_results:
    print(f"{eval_result.evaluator_name}: {eval_result.score}")
    print(f"Justification: {eval_result.justification}")

RAG-Specific Testing

Context-Aware Evaluation

# Test RAG responses with context  
rag_result = client.evaluators.Faithfulness(
    request="What is our return policy?",
    response="Customers can return items within 30 days of purchase for a full refund.",
    contexts=[
        "Company return policy: 30-day return window",
        "Customer service guidelines: Full refunds within 30 days"
    ]
)

context_precision = client.evaluators.Context_Precision(
    request="What is our return policy?",
    response="Items can be returned within 30 days for a full refund.",
    contexts=[
        "Return policy: 30-day return window with full refund",
        "Shipping policy: 3-5 business days delivery"
    ],
    expected_output="Items can be returned within 30 days for a full refund."
)

Ground Truth Testing

Expected Output Validation

# Test against expected answers
correctness_result = client.evaluators.Answer_Correctness(
    request="What is 2 + 2?",
    response="2 + 2 equals 4",
    expected_output="4"
)

similarity_result = client.evaluators.Answer_Semantic_Similarity(
    request="Explain photosynthesis",
    response="Photosynthesis is the process where plants convert sunlight into energy",
    expected_output="Plants use sunlight to create energy through photosynthesis"
)

Testing Methodologies

Batch Testing Function

def batch_evaluate_responses(test_cases, evaluators):
    """
    Evaluate multiple test cases with multiple evaluators
    """
    results = []
    
    for test_case in test_cases:
        case_results = {}
        
        for evaluator_name in evaluators:
            try:
                # Get evaluator method by name
                evaluator_method = getattr(client.evaluators, evaluator_name)
                
                # Run evaluation
                result = evaluator_method(
                    request=test_case["request"],
                    response=test_case["response"],
                    contexts=test_case.get("contexts", [])
                )
                
                case_results[evaluator_name] = {
                    "score": result.score,
                    "justification": result.justification
                }
            except Exception as e:
                case_results[evaluator_name] = {
                    "error": str(e),
                    "score": None
                }
        
        results.append({
            "test_case": test_case,
            "results": case_results
        })
    
    return results

# Example usage
test_cases = [
    {
        "request": "What is machine learning?",
        "response": "Machine learning is a type of AI that learns from data",
        "contexts": ["AI textbook chapter on machine learning"]
    },
    {
        "request": "How do I reset my password?",
        "response": "Click the 'Forgot Password' link on the login page",
        "contexts": ["User manual: password reset instructions"]
    }
]

evaluators = ["Relevance", "Clarity", "Helpfulness", "Truthfulness"]
batch_results = batch_evaluate_responses(test_cases, evaluators)

Regression Testing

def regression_test(baseline_results, current_results, threshold=0.05):
    """
    Compare current results against baseline to detect regressions
    """
    regressions = []
    
    for evaluator in baseline_results:
        baseline_score = baseline_results[evaluator]["score"]
        current_score = current_results[evaluator]["score"]
        
        if current_score < baseline_score - threshold:
            regressions.append({
                "evaluator": evaluator,
                "baseline_score": baseline_score,
                "current_score": current_score,
                "regression": baseline_score - current_score
            })
    
    return regressions

# Example usage
baseline = {
    "Relevance": {"score": 0.85},
    "Clarity": {"score": 0.78},
    "Helpfulness": {"score": 0.82}
}

current = {
    "Relevance": {"score": 0.83},
    "Clarity": {"score": 0.75},
    "Helpfulness": {"score": 0.84}
}

regressions = regression_test(baseline, current)
if regressions:
    print("Regressions detected:")
    for regression in regressions:
        print(f"  {regression['evaluator']}: {regression['regression']:.3f} drop")

Skills-Based Testing

Creating Test Skills

# Create a skill for testing
test_skill = client.skills.create(
    name="Customer Service Bot",
    intent="Provide helpful customer service responses",
    prompt="You are a helpful customer service agent. Answer the customer's question: {{question}}",
    model="gpt-4o",
    validators=[
        {"evaluator_name": "Politeness", "threshold": 0.8},
        {"evaluator_name": "Helpfulness", "threshold": 0.7},
        {"evaluator_name": "Clarity", "threshold": 0.75}
    ]
)

print(f"Created skill: {test_skill.id}")

Best Practices

Test Planning

Define Clear Objectives: Identify what aspects of your LLM application need testing
Select Appropriate Evaluators: Choose evaluators that match your testing goals
Prepare Representative Data: Use realistic test cases that reflect actual usage
Set Meaningful Thresholds: Establish score thresholds that align with quality requirements

Evaluation Design

Use Multiple Evaluators: Combine different evaluators for comprehensive assessment
Include Context When Relevant: Provide context for RAG evaluators
Test Edge Cases: Include challenging scenarios in your test suite
Document Justifications: Review evaluator justifications to understand score reasoning

Continuous Improvement

Regular Testing: Run evaluations consistently during development
Track Score Trends: Monitor evaluation scores over time
Calibrate Thresholds: Adjust score thresholds based on real-world performance
Update Test Cases: Expand test coverage as your application evolves

Integration Examples

CI/CD Pipeline Testing

#!/usr/bin/env python3
"""
CI/CD evaluation script
"""
import sys
from root import RootSignals

def main():
    client = RootSignals(api_key="your-api-key")
    
    # Define minimum acceptable scores
    thresholds = {
        "Relevance": 0.7,
        "Clarity": 0.65,
        "Helpfulness": 0.7,
        "SafetyForChildren": 0.9
    }
    
    # Test cases
    test_cases = [
        {
            "request": "How do I contact support?",
            "response": "You can contact support by calling 1-800-HELP or emailing [email protected]"
        },
        {
            "request": "What are your hours?",
            "response": "We're open Monday through Friday from 9 AM to 6 PM EST"
        }
    ]
    
    failures = []
    
    for i, test_case in enumerate(test_cases):
        print(f"Testing case {i+1}...")
        
        for evaluator_name, threshold in thresholds.items():
            evaluator_method = getattr(client.evaluators, evaluator_name)
            result = evaluator_method(
                request=test_case["request"],
                response=test_case["response"]
            )
            
            if result.score < threshold:
                failures.append({
                    "case": i+1,
                    "evaluator": evaluator_name,
                    "score": result.score,
                    "threshold": threshold,
                    "justification": result.justification
                })
    
    if failures:
        print("❌ Evaluation failures detected:")
        for failure in failures:
            print(f"  Case {failure['case']}: {failure['evaluator']} scored {failure['score']:.3f} (threshold: {failure['threshold']})")
        sys.exit(1)
    else:
        print("✅ All evaluations passed!")

if __name__ == "__main__":
    main()

Troubleshooting

Common Issues

1. Multiple Evaluators with Same Name If you encounter errors like "Multiple evaluators found with name 'X'", use evaluator IDs instead:

# Get evaluator by ID to avoid naming conflicts
evaluators = list(client.evaluators.list())
evaluator_id = next(e.id for e in evaluators if e.name == "Desired Evaluator Name")

result = client.evaluators.run(
    evaluator_id=evaluator_id,
    request="Your request",
    response="Your response"
)

2. Missing Required Parameters Some evaluators require specific parameters:

Ground Truth Evaluators: Require expected_output parameter
RAG Evaluators: Require contexts parameter as a list of strings
Function Call Evaluators: Require functions parameter

3. Evaluator Naming Conventions

Use direct property access: client.evaluators.Relevance()
For multi-word evaluators, use underscores: client.evaluators.Answer_Correctness()
Alternative: Use client.evaluators.run_by_name("evaluator_name") for dynamic names

Best Practices for Robust Testing

Handle Exceptions: Always wrap evaluator calls in try-catch blocks
Validate Parameters: Check required parameters before making calls
Use Consistent Naming: Follow the underscore convention for multi-word evaluators
Monitor API Limits: Be aware of rate limits when running batch evaluations

This comprehensive testing framework ensures your LLM applications meet quality, safety, and performance standards using Root Signals' extensive evaluator library and proven testing methodologies.

PreviousRun batch evaluations NextFind the best prompt and model

Last updated 2 months ago