Tutorial | LLM evaluation#

Dataiku capabilities like the LLM Mesh and prompt studios enable you to test and select the most efficient combination of LLMs and prompts for a Gen AI application. However, evaluating the performance of a Gen AI application as a whole requires a more complex pipeline.

Let’s see how the Evaluate LLM recipe can become a cornerstone of your LLMOps strategy!

Watch the featurette

Get started#

Objectives#

In this tutorial, you will:

  • Use the Evaluate LLM recipe to assess the performance of a Gen AI application.

  • Incorporate this recipe’s output (a model evaluation store) into Dataiku’s existing model lifecycle toolbox.

  • Recognize how this recipe fits into a broader LLMOps strategy.

Prerequisites#

  • Dataiku 13.2.0 or later.

  • Advanced LLM Mesh license flag activated.

  • Full Designer user profile.

  • LLM connection(s) for computing embeddings and requesting prompts. See the reference documentation for information on LLM connections.

  • A Python 3.8+ code environment containing the required packages for LLM evaluation. Use the preset named Evaluation of Large Language Models.

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Gen AI Practitioner > LLM Evaluation.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

Those familiar with model evaluation and monitoring in Dataiku will be comfortable using the Evaluate recipe and its primary output, a model evaluation store (MES). The Evaluate recipe requires both a dataset and a saved model as input.

On the other hand, LLM evaluation, as being discussed here, is not the same kind of exercise. The goal is not to evaluate the quality of the LLM itself — which is a more suitable task for a prompt studio. Rather, the goal is to evaluate a Gen AI application as a whole. These Gen AI applications may involve multiple LLMs, data transformations, and other traditional analytics and ML components.

Accordingly, the Evaluate LLM recipe takes only a dataset as input. Therefore, the starter project includes only one dataset: questions_answers_cleaned. You can imagine this dataset as the output of a typical LLM Flow, such as the one shown below.

Dataiku screenshot of the Flow of a Gen AI application.

Tip

See the Generative AI and Large Language Models (LLMs) section of the Knowledge Base, and in particular Tutorial | Use the Retrieval Augmented Generation (RAG) approach for question-answering, for opportunities to build a Gen AI pipeline like the one shown above.

The exact requirements for the input dataset depend on the type of LLM evaluation task at hand. In some cases, you may only have an input column, whereas in other cases you may also have columns containing the ground truth or context if using a RAG approach.

Let’s take a closer look at the questions_answers_cleaned dataset:

Column

Contents

question

An actual question that could have been asked through a chatbot interface (such as Dataiku Answers).

reference_answer

A response to the question written by a human or expert. In other words, the ground truth.

context

The content of the Knowledge Bank added by a RAG model — for example, through a prompt recipe.

answer

The actual response to a question from an LLM.

  1. From the Flow, open the questions_answered_cleaned dataset.

  2. Take a moment to explore these four columns.

Dataiku screenshot of the starting data for the Evaluate LLM tutorial.

See also

Examining real-world use cases may help draw inspiration for the two general types of datasets that can be used as input to the Evaluate LLM recipe. Consider the following:

  • A Gen AI application that emails an AI summary of a business field — such as one created by Ørsted. The output dataset produced by this kind of batch workflow can be provided as input to the Evaluate LLM recipe.

  • A chatbot that uses a RAG LLM to provide contextualized answers to your collaborators — such as one created by LG Chem. A sample reference dataset of inputs passed through your pipeline to simulate an interaction with users can be provided as input to the Evaluate LLM recipe.

Create an Evaluate LLM recipe#

Although the Evaluate LLM recipe only takes one dataset as input, its three outputs are the same as the Evaluate recipe.

  1. Open the Actions panel of the questions_answered_cleaned dataset.

  2. Select the Evaluate LLM recipe from the LLM recipes section.

  3. Set the following three outputs:

    • qa_output as the output dataset

    • qa_metrics as the metrics dataset

    • Q&A Evaluation as the evaluation store

  4. Click Create Recipe.

Dataiku screenshot of the dialog to create an LLM recipe.

Configure the Evaluate LLM recipe#

Let’s walk through the parameters needed to configure the recipe.

Define the evaluation task#

The first step is to define the type of evaluation task. This tutorial walks though a question answering task. Additional options include summarization, translation, or a non-prescriptive “other”.

The choice of task guides the recipe’s configuration in terms of required input columns and recommended metrics. For example, to compute faithfulness, you need to have columns for the answer and the context. To compute answer correctness, you need columns for the answer and the ground truth. Moreover, a summarization task suggests ROUGE; a translation task suggests BLEU.

  1. In the Input Dataset tile, select Question Answering as the Task.

  2. Select the columns for the fields indicated in the table below:

    Field

    Column

    Input column

    question

    Output column

    answer

    Ground truth column

    reference_answer

    Context column

    context

  3. As the dataset is quite small, sampling can remain turned off.

Select out-of-the-box evaluation metrics#

Like the Evaluate recipe, the Evaluate LLM recipe lets you define alternatives to the default evaluation ID, name, and labels. We’ll return to labels later.

Next, you can select the metrics to be computed. Which metrics are suitable depend on the task at hand. Some of these metrics are more algorithmic (such as ROUGE or BLEU), and others use a technique called LLM-as-a-judge. This second approach is quite new and relies on asking an LLM a specifically-crafted question to have it assess various performance metrics.

See also

More details on these out-of-the-box metrics can be found in the reference documentation on Evaluating Large Language models and Generative AI pipelines.

For LLM-as-a-judge metrics, you’ll also need to choose an LLM connection for an embedding and/or a completion LLM.

  • The embedding model is required for all metrics depending on the distance between two strings, such as answer correctness or faithfulness.

  • The completion LLM is required for all LLM-as-a-judge metrics (and so is optional if you do not compute any of those).

Let’s see how this works.

  1. Click Select Computable Metrics, and observe how most metrics are not selected. This is because many of these metrics require an embedding LLM and/or a completion LLM.

  2. If available, select your Embedding LLM and Completion LLM.

  3. Click Select Computable Metrics once again, and observe how metrics like answer correctness are now included.

Dataiku screenshot of the outputs configuration for an Evaluate LLM recipe.

Optional: Define custom evaluation metrics#

In addition to the out-of-the-box metrics, you can also code your own custom metrics using Python. These metrics can be returned either as a float, where it can be displayed alongside other metrics in the model evaluation store, or as an array of values, which is helpful in row-by-row analysis.

  1. In the Custom Metrics section, click + Add Custom Metric.

  2. Read the default docstring to understand the function’s input parameters, and then delete the default code.

  3. Click Code Samples.

  4. Select Simple custom LLM metric.

  5. Click + Insert.

  6. Click the pencil icon to rename the metric Question word count.

  7. Click Save.

Dataiku screenshot of a custom metric in the Evaluate LLM recipe.

Tip

When writing a custom metric, first develop the code in a notebook. You only need the input evaluation dataset and, potentially, the embedding and completion LLMs. When ready, uncheck all other metrics to run the recipe faster and validate the custom metric’s computation. Note that a failure in the custom metric’s computation will only raise a warning and not fail the entire recipe. The other metrics will be computed and added in the evaluation.

Run the evaluation#

When you are satisfied with the metrics to be computed, you can execute the recipe.

  1. In the Python environment section, confirm that you have selected a compatible code environment as mentioned in the prerequisites.

  2. Click Run.

Analyze the results of an LLM evaluation#

Once the recipe has finished executing, you can explore the outputs.

Explore the model evaluation store#

The most important output is the model evaluation store.

  1. Open the Q&A Evaluation MES to find the selected LLM evaluation metrics presented in a familiar format.

  2. Click Open to view a summary of the model evaluation.

  3. Go to the Row-by-row analysis panel.

  4. Explore some of the individual row output, adding and removing columns as needed.

Dataiku screenshot of row by row analysis of an LLM evaluation.

Explore the output and metrics datasets#

Similar to the standard Evaluate recipe, the Evaluate LLM recipe can also output its results in a dataset format:

  • One dataset containing the input dataset plus the computed metrics.

  • One dataset containing just the metrics.

These datasets can be particularly useful as inputs to other processes such as webapps and/or dashboards.

  1. Open the qa_output dataset, and observe the addition of the metrics to the input dataset’s schema.

  2. Open the qa_metrics dataset, and find the same metrics without the input data, identified by the timestamp of the model evaluation.

Dataiku screenshot of the output datasets to an Evaluate LLM recipe.

Compare LLM evaluations#

Of course, a single LLM evaluation is not very interesting. Let’s produce another evaluation as a means of comparison. At which point, we can create a model comparison — just as we would for model evaluations or models in the Lab.

  1. Open the Evaluate LLM recipe.

  2. Make an arbitrary change to the recipe, such as a different LLM if available. You can also go to the Advanced tab of the recipe and change a parameter, such as the temperature.

  3. After making a change, click Save and then Run to execute the recipe a second time.

  4. Open the Q&A Evaluation MES to observe the results of the second evaluation in relation to the first.

  5. Select all evaluations by checking the box to the right of the Actions button.

  6. Click the Actions button, and select Compare.

  7. In the dialog, click Compare to create a new comparison.

Dataiku screenshot of the dialog for an LLM comparison.

Then you can dive into the results, attempting to understand why some results may have poor metrics.

  1. View the Summary of the comparison.

  2. Navigate to the Row by row comparison to dive deeper into individual results.

Dataiku screenshot of row by row comparison of an LLM evaluation.

Label an LLM evaluation with metadata#

The name of an LLM evaluation (like a model evaluation) defaults to the timestamp at which it was created. Without enough metadata, it can be difficult to keep track of what these evaluations actually refer to.

Labels, in the format of <domain>:<key>, can be added either before the evaluation in the Evaluate LLM recipe or afterwards in the model evaluation store. Let’s demonstrate the former.

  1. Reopen the Evaluate LLM recipe.

  2. Optionally, make another arbitrary change, such as a temperature change in the recipe’s Advanced tab.

  3. In the Outputs tile of the recipe’s Settings tab, click + Add Label. Possible examples are provided in the screenshot below.

  4. Click Run to produce another LLM evaluation.

Dataiku screenshot of the labels section of an Evaluate LLM recipe.

Now we can display these labels in the model evaluation store.

  1. When the recipe is finished running, open the Q&A Evaluation MES.

  2. Open the Display also dropdown.

  3. Select the new labels.

  4. Confirm they appear as columns in the MES.

Dataiku screenshot of a MES with labels displayed.

Tip

You can also open previous evaluations, and add labels in the Metadata section of the Summary panel.

Automate the LLM lifecycle#

As you may have guessed, because the Evaluate LLM recipe outputs a model evaluation store, it can be incorporated into Dataiku’s standard automation toolbox of checks and scenarios.

Tip

If you are unfamiliar with features like checks and scenarios, start with the Academy course on Data Quality & Automation.

  1. Open the Q&A Evaluation MES.

  2. Navigate to the Settings tab.

  3. Go to the Status checks subtab.

  4. Click Metric Value is in a Numeric Range.

  5. Name the check Answer relevancy.

  6. Select Answer relevancy as the metric to check.

  7. For most metrics, you’ll often want to set minimum and soft minimum values (such as 0.5 and 0.8, respectively) to issue warnings or errors according to your use case.

  8. Click Check to test it.

  9. Click Save.

Dataiku screenshot of a check on a MES.

Once you have a check on the metric of answer relevancy, you can create a scenario to send an alert if the metric deviates from the expected range.

Dataiku screenshot of a scenario for running a check after building a MES.

Important

This tutorial has focused on LLM evaluation, but there is another similar aspect: monitoring. The model evaluation store can be used both for evaluation and monitoring.

For the latter, the input dataset (in this case questions_answered_cleaned) would be the logging dataset from your production platform. From this dataset, you can select the metrics relevant to your usage, and receive progress notifications using the same kind of scenario shown here.

To learn more about working with production environments in Dataiku, see the MLOps Practitioner learning path.

What’s next?#

Congratulations! You’ve seen not only how the Evaluate LLM recipe can assess the performance of a Gen AI application, but also how it fits into Dataiku’s broader ecosystem for MLOps, or in this case, LLMOps.

See also

Consult the reference documentation for more information on Generative AI and LLM Mesh and Evaluating Large Language models and Generative AI pipelines.