The evaluation framework module, part of the AI ecosystem.

An evaluation is a test of how good an AI system is at achieving a specific goal. This module aims to be a framework to help sitebuilders quickly put together evaluations for a variety of different goals, approachs, AI systems and models and compare approaches to other approaches.

This can include roughly evaluating the quality of content produced, evaluating how good a chatbot's answer is, or evaluating whether or not an AI descision matched the human approved system.

This will include tools for conducting the initial evaluations as well as doing spot checks to regularly check how well the AI system is holding up. It will provide tools for humans to see an overview but also tools to drill down and see where errors have happened in a specific run of the AI system (such as a specific chat message). This can help with improving the prompts. Evaluations can be be conducted by humans but also AI can be used to evaluate itself to help a human quickly get to the heart of any issues within the system.

This is not the same as the standard AI evaluation systems that exist in opensource as this framework is for evaluating the whole system, not just evaluating a specific prompt (however we think that this module could be used to procude those kind of evaluations and training data for fine tuning). It will also a provide a tool to easily export and share evaluations to assist with debugging.

Currently the focus is on Drupal CMS, but this will be expanded to all areas of the AI module.

We don't want to build everything in Drupal and will explore external evaluation libraries where possible. However, due to Drupal holding the content in a consistent manner we will likely find many things need to be reimplemented in Drupal and exported to external libraries.

This will support a number of different goals and approaches to evaluations:

Goals

AI aim to make descisions that exactly match what a human would make and its clear cut. Such as deciding if a comment is relevant to the content.
AI aims write creative content that "feels" useful. Such as helping ideate or writing alt text for images.
AI aims to achieve human tasks where even if it doesn't get it 100% right, its still useful and makes a user happy. Such as helping a marketteer create Drupal configuration like views.

Outputs

Reports that show the overall success of an AI system for a Goal.
Ability to compare one AI System to another for the same Goal (Such as a change in prompt or configuration), A/B Testing
Ability to drill down and look at specific prompts and responses to see where things have gone wrong to help with debugging and fixing the system.
Exports to allow evaluations for one site to be collected and analysed as a group, or even exported as Evaluations that can be submitted back to the models to assist with model training.

Approaches

Evaluations may be clear cut and dry "correct or incorrect" and can have simple statistics of how often AI gets it right.
Evaluations may be based on fuzzier data such as a scale from 1-10 based on the quality a user perceives using a number of metrics. (How creative is this content?)
Evaluations may be based on fuzzier data but at a scale that it needs to be reduced to something simple such as a thumbs up or thumbs down on a chatbot.
Evaluations may allow people to write up unstructured descriptions of how the LLM went and write notes which can in turn be analysed with AI or exposed in the reports.
Evaluations can be for a specific prompts or a whole system. For example one step in the AI chain might be using OCR to extract text from an image. One system may use one provider of OCR services and another a different one. The Evaluations framework will allow you compare changes in the whole system including evaluating the effectiveness of the components that are not AI.

AI Ongoing Evaluations

Activity

Releases