Comprehensively Testing Your LLM Code
Overview
Root Signals provides a multi-dimensional testing framework that ensures your LLM applications perform reliably across response quality, security, performance, and messaging alignment. This systematic approach helps you identify and prevent issues before they impact production.
Testing Dimensions
1. Response Quality
Correctness and Accuracy
Factual accuracy validation
Context relevance assessment
Coherence and consistency checks
Completeness verification
Implementation:
from root import RootSignals
client = RootSignals(api_key="your-api-key")
# Test response quality with multiple evaluators
relevance_result = client.evaluators.Relevance(
request="What is the capital of France?",
response="The capital of France is Paris, which is located in the north-central part of the country."
)
coherence_result = client.evaluators.Coherence(
request="Explain machine learning",
response="Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed."
)
completeness_result = client.evaluators.Completeness(
request="List the benefits of renewable energy",
response="Renewable energy reduces carbon emissions, lowers long-term costs, and provides energy independence."
)
2. Security & Privacy
Content Safety
Harmlessness validation
Toxicity detection
Child safety assessment
Implementation:
# Security-focused evaluators
safety_result = client.evaluators.Harmlessness(
request="How do I protect my personal information online?",
response="To protect your personal information online, use strong passwords, enable two-factor authentication, and be cautious about sharing sensitive data."
)
toxicity_result = client.evaluators.Non_toxicity(
request="What do you think about this situation?",
response="I understand your frustration, and I'd be happy to help you find a solution."
)
child_safety_result = client.evaluators.Safety_for_Children(
request="Tell me about animals",
response="Animals are fascinating creatures that live in many different environments around the world."
)
3. Performance & Effectiveness
Response Quality Metrics
Helpfulness assessment
Clarity evaluation
Precision measurement
Implementation:
# Performance-focused evaluators
helpfulness_result = client.evaluators.Helpfulness(
request="I need help setting up my email",
response="I'd be happy to help you set up your email. First, let's identify which email provider you're using..."
)
clarity_result = client.evaluators.Clarity(
request="Explain quantum computing",
response="Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously, enabling parallel processing of information."
)
precision_result = client.evaluators.Precision(
request="What is the population of Tokyo?",
response="The population of Tokyo is approximately 14 million people in the metropolitan area."
)
4. Messaging Alignment
Communication Style
Tone and formality validation
Politeness assessment
Persuasiveness measurement
Implementation:
# Messaging alignment evaluators
politeness_result = client.evaluators.Politeness(
request="I want to return this product",
response="I'd be happy to help you with your return. Let me walk you through the process."
)
formality_result = client.evaluators.Formality(
request="Please provide the quarterly report",
response="The quarterly report has been prepared and is attached for your review."
)
persuasiveness_result = client.evaluators.Persuasiveness(
request="Why should I choose your service?",
response="Our service offers 24/7 support, competitive pricing, and a proven track record of customer satisfaction."
)
Testing Approaches
Single Evaluator Testing
Basic Evaluation
# Test a single response with one evaluator
result = client.evaluators.Truthfulness(
request="What was the revenue in Q1 2023?",
response="The revenue in Q1 2023 was 5.2 million USD.",
contexts=[
"Financial statement of 2023: Q1 revenue was 5.2M USD",
"2023 revenue and expenses report"
]
)
print(f"Score: {result.score}")
print(f"Justification: {result.justification}")
Multi-Evaluator Testing with Judges
Judge-Based Evaluation
# Use judges to run multiple evaluators together
judge_result = client.judges.run(
judge_id="your-judge-id",
request="What are the benefits of our product?",
response="Our product offers excellent value, superior quality, and outstanding customer support."
)
# Process multiple evaluator results
for eval_result in judge_result.evaluator_results:
print(f"{eval_result.evaluator_name}: {eval_result.score}")
print(f"Justification: {eval_result.justification}")
RAG-Specific Testing
Context-Aware Evaluation
# Test RAG responses with context
rag_result = client.evaluators.Faithfulness(
request="What is our return policy?",
response="Customers can return items within 30 days of purchase for a full refund.",
contexts=[
"Company return policy: 30-day return window",
"Customer service guidelines: Full refunds within 30 days"
]
)
context_precision = client.evaluators.Context_Precision(
request="What is our return policy?",
response="Items can be returned within 30 days for a full refund.",
contexts=[
"Return policy: 30-day return window with full refund",
"Shipping policy: 3-5 business days delivery"
],
expected_output="Items can be returned within 30 days for a full refund."
)
Ground Truth Testing
Expected Output Validation
# Test against expected answers
correctness_result = client.evaluators.Answer_Correctness(
request="What is 2 + 2?",
response="2 + 2 equals 4",
expected_output="4"
)
similarity_result = client.evaluators.Answer_Semantic_Similarity(
request="Explain photosynthesis",
response="Photosynthesis is the process where plants convert sunlight into energy",
expected_output="Plants use sunlight to create energy through photosynthesis"
)
Testing Methodologies
Batch Testing Function
def batch_evaluate_responses(test_cases, evaluators):
"""
Evaluate multiple test cases with multiple evaluators
"""
results = []
for test_case in test_cases:
case_results = {}
for evaluator_name in evaluators:
try:
# Get evaluator method by name
evaluator_method = getattr(client.evaluators, evaluator_name)
# Run evaluation
result = evaluator_method(
request=test_case["request"],
response=test_case["response"],
contexts=test_case.get("contexts", [])
)
case_results[evaluator_name] = {
"score": result.score,
"justification": result.justification
}
except Exception as e:
case_results[evaluator_name] = {
"error": str(e),
"score": None
}
results.append({
"test_case": test_case,
"results": case_results
})
return results
# Example usage
test_cases = [
{
"request": "What is machine learning?",
"response": "Machine learning is a type of AI that learns from data",
"contexts": ["AI textbook chapter on machine learning"]
},
{
"request": "How do I reset my password?",
"response": "Click the 'Forgot Password' link on the login page",
"contexts": ["User manual: password reset instructions"]
}
]
evaluators = ["Relevance", "Clarity", "Helpfulness", "Truthfulness"]
batch_results = batch_evaluate_responses(test_cases, evaluators)
Regression Testing
def regression_test(baseline_results, current_results, threshold=0.05):
"""
Compare current results against baseline to detect regressions
"""
regressions = []
for evaluator in baseline_results:
baseline_score = baseline_results[evaluator]["score"]
current_score = current_results[evaluator]["score"]
if current_score < baseline_score - threshold:
regressions.append({
"evaluator": evaluator,
"baseline_score": baseline_score,
"current_score": current_score,
"regression": baseline_score - current_score
})
return regressions
# Example usage
baseline = {
"Relevance": {"score": 0.85},
"Clarity": {"score": 0.78},
"Helpfulness": {"score": 0.82}
}
current = {
"Relevance": {"score": 0.83},
"Clarity": {"score": 0.75},
"Helpfulness": {"score": 0.84}
}
regressions = regression_test(baseline, current)
if regressions:
print("Regressions detected:")
for regression in regressions:
print(f" {regression['evaluator']}: {regression['regression']:.3f} drop")
Skills-Based Testing
Creating Test Skills
# Create a skill for testing
test_skill = client.skills.create(
name="Customer Service Bot",
intent="Provide helpful customer service responses",
prompt="You are a helpful customer service agent. Answer the customer's question: {{question}}",
model="gpt-4o",
validators=[
{"evaluator_name": "Politeness", "threshold": 0.8},
{"evaluator_name": "Helpfulness", "threshold": 0.7},
{"evaluator_name": "Clarity", "threshold": 0.75}
]
)
print(f"Created skill: {test_skill.id}")
Best Practices
Test Planning
Define Clear Objectives: Identify what aspects of your LLM application need testing
Select Appropriate Evaluators: Choose evaluators that match your testing goals
Prepare Representative Data: Use realistic test cases that reflect actual usage
Set Meaningful Thresholds: Establish score thresholds that align with quality requirements
Evaluation Design
Use Multiple Evaluators: Combine different evaluators for comprehensive assessment
Include Context When Relevant: Provide context for RAG evaluators
Test Edge Cases: Include challenging scenarios in your test suite
Document Justifications: Review evaluator justifications to understand score reasoning
Continuous Improvement
Regular Testing: Run evaluations consistently during development
Track Score Trends: Monitor evaluation scores over time
Calibrate Thresholds: Adjust score thresholds based on real-world performance
Update Test Cases: Expand test coverage as your application evolves
Integration Examples
CI/CD Pipeline Testing
#!/usr/bin/env python3
"""
CI/CD evaluation script
"""
import sys
from root import RootSignals
def main():
client = RootSignals(api_key="your-api-key")
# Define minimum acceptable scores
thresholds = {
"Relevance": 0.7,
"Clarity": 0.65,
"Helpfulness": 0.7,
"SafetyForChildren": 0.9
}
# Test cases
test_cases = [
{
"request": "How do I contact support?",
"response": "You can contact support by calling 1-800-HELP or emailing [email protected]"
},
{
"request": "What are your hours?",
"response": "We're open Monday through Friday from 9 AM to 6 PM EST"
}
]
failures = []
for i, test_case in enumerate(test_cases):
print(f"Testing case {i+1}...")
for evaluator_name, threshold in thresholds.items():
evaluator_method = getattr(client.evaluators, evaluator_name)
result = evaluator_method(
request=test_case["request"],
response=test_case["response"]
)
if result.score < threshold:
failures.append({
"case": i+1,
"evaluator": evaluator_name,
"score": result.score,
"threshold": threshold,
"justification": result.justification
})
if failures:
print("❌ Evaluation failures detected:")
for failure in failures:
print(f" Case {failure['case']}: {failure['evaluator']} scored {failure['score']:.3f} (threshold: {failure['threshold']})")
sys.exit(1)
else:
print("✅ All evaluations passed!")
if __name__ == "__main__":
main()
Troubleshooting
Common Issues
1. Multiple Evaluators with Same Name If you encounter errors like "Multiple evaluators found with name 'X'", use evaluator IDs instead:
# Get evaluator by ID to avoid naming conflicts
evaluators = list(client.evaluators.list())
evaluator_id = next(e.id for e in evaluators if e.name == "Desired Evaluator Name")
result = client.evaluators.run(
evaluator_id=evaluator_id,
request="Your request",
response="Your response"
)
2. Missing Required Parameters Some evaluators require specific parameters:
Ground Truth Evaluators: Require
expected_output
parameterRAG Evaluators: Require
contexts
parameter as a list of stringsFunction Call Evaluators: Require
functions
parameter
3. Evaluator Naming Conventions
Use direct property access:
client.evaluators.Relevance()
For multi-word evaluators, use underscores:
client.evaluators.Answer_Correctness()
Alternative: Use
client.evaluators.run_by_name("evaluator_name")
for dynamic names
Best Practices for Robust Testing
Handle Exceptions: Always wrap evaluator calls in try-catch blocks
Validate Parameters: Check required parameters before making calls
Use Consistent Naming: Follow the underscore convention for multi-word evaluators
Monitor API Limits: Be aware of rate limits when running batch evaluations
This comprehensive testing framework ensures your LLM applications meet quality, safety, and performance standards using Root Signals' extensive evaluator library and proven testing methodologies.
Last updated