Roadmap
Root Signals builds with the philosophy of transparency with multiple open source projects. This roadmap is a living document about what we're working on and what's next. Root Signals is the world's most principled and powerful system for measuring the behaviour of LLM based applications, agents and workflows.
Scorable is the automated LLM Evaluation Engineer agent for co-managing this platform with you.
Vision
Our vision is to create and auto-optimize the strongest automated knowledge process evaluation stack possible, with the least amount of effort and information from the user.
Maximum Automated Information Extraction
From user intent and/or provided example/instruction data, extract as much relevant information as possible.
Awareness of the information quality
Engage the user with the smallest amount of maximally impactful questions.
Maximally Powerful Evaluation Stack Generation
Build the most comprehensive and accurate evaluation capabilities possible, within the confines of data available.
Built for Agents
Maximum compatibility with autonomous agents and workflows.
Maximum Integration Surface
Seamless integration with all key AI frameworks.
EvalOps Principles for Long Term
Follow Root EvalOps Principles for evaluator lifecycle management.
🚀 Recently Released
✅ Automated Policy Adherence Judges
Create judges from uploaded policy documents and intents
✅ GDPR awareness of models (link)
Ability to filter out models not complying with GDPR
✅ Evaluator Calibration Data Synthesizer v1.0 (link)
In the evaluator drill-in view, expand your calibration dataset from 1 or more examples
✅ Evaluator version history and control to include all native Root Evaluators (link)
✅ Evaluator determinism benchmarks and standard deviations in reference datasets (link)
✅ Agent Evaluation MCP: stdio & SSE versions (link)
✅ Root Judge LLM 70B judge available for download and running in Root Signals for free!
✅ Public Evaluation Reports - Generate HTML reports from any judge execution
🏗️ Next Up
TypeScript SDK
Rehashing of Example-driven Evaluation
Smoothly create the full judge from examples
Native Speech Evaluator API
Upload or stream audio directly to evaluators
Unified Experiments framework to Replace Skill Tests
Command Line Interface
Advanced Judge visibility controls
RBAC coverage on Judges (as in Evaluators, Skills and Datasets)
Output Refinement At-Origin
Refine your LLM outputs automatically based on scores
🗓️ Planned
Scorable Features
Agentic Classifier Generation 2.0
Create classifiers with the same robustness as metric evaluator stacks
Automatic Context Engineering
Refine your prompt templates automatically based on scores
Support all RAG evaluators
Core Platform Features
Improved Playground View
Root Evaluators
Agent Evaluation Pack 2.0
(Root Evaluator list expanding every 1-2 weeks, stay tuned)
Integrations
Full OpenTelemetry Support
LiteLLM Direct Support
OpenRouter Support
(more coming)
Developer Tools
Sync Judge & Evaluator Definitions to GitHub
Community & Deployment
Community Evals
Self-Hostable Evaluation Executor
MCP
Remote MCP Server
MCP Feature Extension Pack
Full judge feature access
Full log insights access
Models support
Reasoner-specific model parameters (incl. budget) in evaluators
(model support list continuously expanded, stay tuned)
More Planned Features coming as we sync our changelogs and the rest of the internal roadmap contents!
Feature Requests and Bug Reports:
🐛 Bug Reports: GitHub Issues
📧 Enterprise Features: Contact [email protected]
💡 General: Discord
Last updated: 2025-06-30
Last updated