ai_eval

AI Eval measures and improves the quality of your AI integrations in Drupal. Define test datasets in YAML, run them against your agents or any AI provider, and get scored results with pass/fail quality gates.

Two evaluation modes

Agent mode invokes ai_agents plugins end-to-end, testing the full loop: tool calls, reasoning, and response quality.

Chat mode sends prompts directly to any AI provider (Anthropic, OpenAI, Ollama). No agent framework needed. Useful for evaluating system prompts, RAG pipelines, Q&A bots, or classification tasks.

Results dashboard

The results dashboard shows status cards with run summaries and week-over-week trends, plus a target health table with score meters, sparkline trends, and error counts.

Each result detail page shows collapsible per-question breakdowns with grader scores, tooltips with judge reasoning, and a weakest grader callout.

Prompt optimization

When a target fails its gate, the optimizer analyzes the failure patterns and proposes an improved system prompt. Review proposals with a side-by-side prompt comparison before applying.

LLM judges are noisy, so the optimizer runs each candidate prompt multiple times and compares the average against an equally-repeated baseline. You can tune the number of runs per candidate, require a minimum dataset size before the optimizer will run, and set an improvement margin that stays above the judge's own variance floor.

Pluggable graders

Ships with seven graders. Five are LLM-based judges that score responses on a 0-5 scale:

relevance_grader, completeness_grader, actionability_grader: rubric-based scoring across three response dimensions.
accuracy_grader: rubric with explicit handling for must_not_contain matches (hard cap at 1) and evasive answers (score ≤ 2). Reads an optional expected_facts block from the dataset and weaves it into the judge prompt as ground truth.
fact_match_grader: dedicated fact-recall judge that scores only against expected_facts. Skips cleanly when a row has none.

Two graders are deterministic and run locally with no API cost:

format_grader: validates response length against a configurable character cap.
tool_usage_grader: scores against an expected_tools list of {tool, should_run} pairs, naming missing or forbidden tools in the reason.

You can add your own graders as plugins in any module.

Judge validation

An LLM judge that disagrees with humans is an unvalidated ruler. If you gate deployments or optimize prompts based on judge scores, you need to know how much to trust those scores in the first place.

Two Drush commands close that loop:

drush ai-eval:sample-traces pulls stored judge traces for one grader and writes a YAML fixture stubbed for human labeling.
drush ai-eval:validate-judge compares your labels to the judge's scores and reports true-positive and true-negative rates with a pass/fail verdict against Shankar's trust rule (TPR ≥ 0.9 AND TNR ≥ 0.9).

Run this before you trust judge scores for gating or optimization decisions.

Quality gates

Each target defines a threshold and a gate type. Hard gates fail the run when the score is below threshold, which is what you want in CI pipelines. Soft gates log a warning instead. Use hard gates to block deployments when AI quality drops.

Response char limit

Long agent responses can silently blow out judge prompts and cost. A global response_char_limit caps the length of the response text the judge sees, with a per-target override for targets that need more headroom. The cap is explicit rather than hidden inside grader code.

Admin UI

Configure targets with select fields for agents, providers, and models
Browse evaluation history with score meters and pass/fail indicators
Review and apply or reject optimization proposals with side-by-side prompt comparison
Settings for judge provider, scoring thresholds, rate limiting, response char limit, optimizer noise controls, and dataset paths

Drush commands

drush ai-eval:run (alias aer) — run every enabled target, store scored results, decide pass/fail against each target's gate. Scope with --target, --question, impersonate with --as-user, tag cron runs with --source=cron, or preview with --dry-run.
drush ai-eval:optimize (alias aeo) — rewrite prompts on targets that failed their last gate. Runs baseline and candidate multiple times to average over judge noise. Use --propose to queue candidates for human review instead of auto-applying.
drush ai-eval:distill (alias aed) — summarize recent results as markdown intel files (weakest graders, recurring failure patterns, trend deltas). Good for team channels or planning docs.
drush ai-eval:sample-traces (alias aest) — pull stored judge traces for one grader into a YAML fixture stubbed for human labeling. Stratified by default, windowed by --window-days, sized by --n.
drush ai-eval:validate-judge (alias aevj) — compare human labels to judge scores and report TPR / TNR with a pass/fail verdict against Shankar's trust rule (TPR ≥ 0.9 AND TNR ≥ 0.9).

Requirements

Drupal 11.2+, PHP 8.3+, AI module. Optional: AI Agents for agent mode.

Documentation

Full documentation is in the git repository:

README for installation, configuration, built-in graders, custom grader examples, and scoring
Integration guide for provider setup, dataset writing, per-user access control testing, and cost estimation

Version	Type	Release date
1.0.0-alpha10	Pre-release	Apr 25, 2026
1.0.0-alpha8	Pre-release	Apr 23, 2026
1.0.0-alpha7	Pre-release	Apr 15, 2026
1.0.0-alpha6	Pre-release	Apr 9, 2026
1.0.0-alpha5	Pre-release	Apr 9, 2026
1.0.0-alpha4	Pre-release	Apr 8, 2026
1.0.0-alpha3	Pre-release	Apr 6, 2026
1.0.0-alpha2	Pre-release	Apr 5, 2026
1.0.0-alpha1	Pre-release	Apr 5, 2026