ai_eval
AI Eval measures and improves the quality of your AI integrations in Drupal. Define test datasets in YAML, run them against your agents or any AI provider, and get scored results with pass/fail quality gates.
Two evaluation modes
Agent mode invokes ai_agents plugins end-to-end, testing the full loop: tool calls, reasoning, and response quality.
Chat mode sends prompts directly to any AI provider (Anthropic, OpenAI, Ollama). No agent framework needed. Useful for evaluating system prompts, RAG pipelines, Q&A bots, or classification tasks.
Results dashboard
The results dashboard shows status cards with run summaries and week-over-week trends, plus a target health table with score meters, sparkline trends, and error counts.
Each result detail page shows collapsible per-question breakdowns with grader scores, tooltips with judge reasoning, and a weakest grader callout.
Prompt optimization
When a target fails its gate, the optimizer analyzes the failure patterns and proposes an improved system prompt. Review proposals with a side-by-side prompt comparison before applying.
LLM judges are noisy, so the optimizer runs each candidate prompt multiple times and compares the average against an equally-repeated baseline. You can tune the number of runs per candidate, require a minimum dataset size before the optimizer will run, and set an improvement margin that stays above the judge's own variance floor.
Pluggable graders
Ships with seven graders. Five are LLM-based judges that score responses on a 0-5 scale:
relevance_grader,completeness_grader,actionability_grader: rubric-based scoring across three response dimensions.accuracy_grader: rubric with explicit handling formust_not_containmatches (hard cap at 1) and evasive answers (score ≤ 2). Reads an optionalexpected_factsblock from the dataset and weaves it into the judge prompt as ground truth.fact_match_grader: dedicated fact-recall judge that scores only againstexpected_facts. Skips cleanly when a row has none.
Two graders are deterministic and run locally with no API cost:
format_grader: validates response length against a configurable character cap.tool_usage_grader: scores against anexpected_toolslist of{tool, should_run}pairs, naming missing or forbidden tools in the reason.
You can add your own graders as plugins in any module.
Judge validation
An LLM judge that disagrees with humans is an unvalidated ruler. If you gate deployments or optimize prompts based on judge scores, you need to know how much to trust those scores in the first place.
Two Drush commands close that loop:
drush ai-eval:sample-tracespulls stored judge traces for one grader and writes a YAML fixture stubbed for human labeling.drush ai-eval:validate-judgecompares your labels to the judge's scores and reports true-positive and true-negative rates with a pass/fail verdict against Shankar's trust rule (TPR ≥ 0.9 AND TNR ≥ 0.9).
Run this before you trust judge scores for gating or optimization decisions.
Quality gates
Each target defines a threshold and a gate type. Hard gates fail the run when the score is below threshold, which is what you want in CI pipelines. Soft gates log a warning instead. Use hard gates to block deployments when AI quality drops.
Response char limit
Long agent responses can silently blow out judge prompts and cost. A global response_char_limit caps the length of the response text the judge sees, with a per-target override for targets that need more headroom. The cap is explicit rather than hidden inside grader code.
Admin UI
- Configure targets with select fields for agents, providers, and models
- Browse evaluation history with score meters and pass/fail indicators
- Review and apply or reject optimization proposals with side-by-side prompt comparison
- Settings for judge provider, scoring thresholds, rate limiting, response char limit, optimizer noise controls, and dataset paths
Drush commands
drush ai-eval:run(aliasaer) — run every enabled target, store scored results, decide pass/fail against each target's gate. Scope with--target,--question, impersonate with--as-user, tag cron runs with--source=cron, or preview with--dry-run.drush ai-eval:optimize(aliasaeo) — rewrite prompts on targets that failed their last gate. Runs baseline and candidate multiple times to average over judge noise. Use--proposeto queue candidates for human review instead of auto-applying.drush ai-eval:distill(aliasaed) — summarize recent results as markdown intel files (weakest graders, recurring failure patterns, trend deltas). Good for team channels or planning docs.drush ai-eval:sample-traces(aliasaest) — pull stored judge traces for one grader into a YAML fixture stubbed for human labeling. Stratified by default, windowed by--window-days, sized by--n.drush ai-eval:validate-judge(aliasaevj) — compare human labels to judge scores and report TPR / TNR with a pass/fail verdict against Shankar's trust rule (TPR ≥ 0.9 AND TNR ≥ 0.9).
Requirements
Drupal 11.2+, PHP 8.3+, AI module. Optional: AI Agents for agent mode.
Documentation
Full documentation is in the git repository:
- README for installation, configuration, built-in graders, custom grader examples, and scoring
- Integration guide for provider setup, dataset writing, per-user access control testing, and cost estimation