ai_eval
AI Eval measures and improves the quality of your AI integrations in Drupal. Define test datasets in YAML, run them against your agents or any AI provider, and get scored results with pass/fail quality gates.
Two evaluation modes
Agent mode invokes ai_agents plugins end-to-end, testing the full loop: tool calls, reasoning, and response quality.
Chat mode sends prompts directly to any AI provider (Anthropic, OpenAI, Ollama). No agent framework needed. Useful for evaluating system prompts, RAG pipelines, Q&A bots, or classification tasks.
Pluggable graders
Ships with seven graders. Four are LLM-based judges (relevance, completeness, accuracy, actionability) that score responses on a 1-5 scale. Three are deterministic (format validation, route matching, structured field matching) and run locally with no API cost. You can add your own graders as plugins in any module.
Quality gates
Each target defines a threshold and a gate type. Hard gates fail the run if the score is below threshold. Soft gates log a warning. Use hard gates in CI to block deployments when quality drops.
Prompt optimization
When a target fails its gate, the optimizer analyzes the failure patterns and proposes an improved system prompt. Proposals can auto-apply or go through an admin review workflow before taking effect.
Admin UI
- Configure targets, graders, and scoring thresholds
- Browse eval run history with per-question breakdowns
- Review and apply/reject optimization proposals
- Settings for judge provider, rate limiting, and dataset paths
Drush commands
drush ai-eval:run Run all targets
drush ai-eval:optimize Optimize failing prompts
drush ai-eval:distill Summarize results as markdown
Requirements
Drupal 11.2+, PHP 8.3+, AI module. Optional: AI Agents for agent mode.