Tutorial | Summarize text with generative AI#

Dataiku allows you to use large language models to summarize your data using a visual recipe named Summarize text. All you have to do is provide an input dataset to the LLM and ask the recipe to summarize the content of a specific column, in the language you set.

This tutorial guides you through all the steps necessary to use the summarization recipe.

Get started#

Objectives#

In this tutorial, you will:

  • Create and configure a Summarize text recipe to summarize the content of some articles.

  • Explore the output dataset.

Prerequisites#

To complete this tutorial, you’ll need:

  • A Dataiku instance (version 12.3 and above). Dataiku Cloud is compatible.

  • A connection to at least one supported generative AI model. Your administrator must configure them beforehand in the Administration panel > Connections > New connection > LLM Mesh.

    Supported model connections include models such as OpenAI, Hugging Face, Cohere, etc.

Create the project#

This tutorial uses a dataset of articles from the Dataiku Knowledge Base. We’ll work with a subset of 10 articles to reduce computation cost.

To create the project:

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > ML Practitioner > LLM - Summarization.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Create the summarization recipe#

The first step is to attach the summarization recipe to the input dataset. In this tutorial, the dataset we’ll use includes the URL and text of each knowledge base article.

Let’s use the Summarize text recipe to summarize the content of the cleaned_text column.

To do so:

  1. From the Flow, select the dataiku_knowledge_base_sample dataset and in the Actions tab, click Summarize text under the LLM Recipes section.

  2. In the New summarization recipe window, keep the selected input dataset and name the output dataset dataiku_knowledge_base_sample_summarized.

  3. Click Create Recipe. It opens the recipe settings page.

Configure the summarization recipe#

Now, let’s see how to configure the recipe.

  1. In the LLM option, select the connection to the large language model of your choice.

    Important

    If the list is empty, ask your administrator to create a connection for you (see the prerequisites above).

  2. In the Input column option, select cleaned_text, the column that contains the text of each article to summarize.

  3. Optionally, enter English in the Summary language field to specify the language in which the summary must be written. Note that you can only specify one language at a time.

Note

You can set a specific length of the summary. To do so:

  • Check the Set desired output length to display the length options.

  • Enter an integer in the Desired summary length field.

  • Select the length unit (words or sentences).

Run the recipe and explore the output#

Now that we’re all set, let’s run the recipe and explore the output dataset.

  1. Still in the recipe settings page, click Run.

  2. Once finished, click the Explore dataset dataiku_knowledge_base_sample_summarized link at the bottom of the page to open the output dataset.

Screenshot of the output dataset of the Summarize text recipe.

The recipe has added two columns to our input dataset:

  • summarized_text that includes the summary of each article.

  • llm_error_message that stores any error message.

What’s next?#

Congratulations! You have summarized the content of some long articles using the Summarize text recipe.

You can explore other LLM features such as the classification recipe to categorize your data into different classes.