Tutorial | Classify text with generative AI#

Dataiku allows you to use large language models (LLMs) to classify your data using a visual recipe named Classify text. This automates the process so you don’t have to manually label the rows of your dataset like in supervised learning. Instead, depending on your use case, you can use models with preset classes or tell a model to use a list of classes you define, and the LLM will classify each row of your dataset.

As classification can use either task-based classes or classes specified by the user, this tutorial will guide you through two use cases:

  • Analyzing the sentiments in some product reviews using sentiment analysis.

  • Classifying news articles using user-specified classes.

Get started#

Objectives#

In this tutorial, you will use an LLM classification with:

  • Model-defined classes to classify product reviews into three classes: positive, negative or neutral.

  • User-defined classes to classify some Reuters articles into custom classes, based on the headlines.

Important

Using an LLM has a cost and the cost depends on the LLM you select. In this tutorial, to limit the cost, we’re using the Chat GPT 3.5 model from OpenAI and small datasets.

Prerequisites#

To complete this tutorial, you’ll need:

  • A Dataiku instance (version 12.3 and above). Dataiku Cloud is compatible.

  • A connection to at least one supported generative AI model. Your administrator must configure them beforehand in the Administration panel > Connections > New connection > LLM Mesh.

    Supported model connections include models such as OpenAI, Hugging Face, Cohere, etc.

Create the project#

This tutorial uses a dataset of articles from the Reuters news agency that is publicly available on Kaggle. We’ll work with a subset of 1,000 articles to reduce computation cost.

To create the project:

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > ML Practitioner > LLM - Classification.

  2. From the project homepage, click Go to Flow (or g + f).

As shown below, the project includes two Flow zones, one for each use case.

Screenshot of the initial Flow for the classification tutorial.

Note

You can also download the starter project from this website and import it as a zip file.

Analyze the sentiments in product reviews#

Let’s start by classifying product reviews using a sentiment analysis. In the end, all reviews will be tagged as positive, negative, or neutral.

Create the classification recipe#

The first step is to create the classification recipe from your dataset.

To do so:

  1. From the Flow, in the Classification using task-specific classes Flow zone (in blue), select the product_review dataset.

  2. In the Actions tab of the right panel, under the LLM recipes section, click the Classify text recipe. It opens the text classification use case dialog.

  3. Click on Sentiment analysis.

  4. In the New classification with model-defined classes recipe dialog:

    • Keep the input dataset.

    • Name the output dataset product_review_sentimental_analysis.

    • Store it in the location of your choice.

  5. Click Create recipe. This opens the settings page of the classification recipe.

Configure the sentiment analysis recipe#

Now, you must configure the recipe.

  1. In the LLM dropdown, select the LLM you want to use for sentiment analysis. In this tutorial, we use the GPT 3.5 model. Remember that the dropdown only lists connections that your administrator has previously set in the Administration panel > Connections > New connection > LLM Mesh.

  2. In the Input column field, select the text column. It includes the clients’ reviews we want to analyze.

  3. In the Task field, select Sentiment analysis (+/=/-).

  4. Leave the output mode as is.

Note

If you select an LLM other than Chat GPT 3.5, the options may vary.

Run the recipe and explore the output#

Now that we’re all set, let’s run the recipe and explore the output dataset.

  1. Still in the recipe settings page, click Run.

  2. Once finished, click the Explore dataset product_review_sentimental_analysis link at the bottom of the page to open the output dataset.

Screenshot of the output dataset.

The prediction is stored in the llm_prediction column. As we kept the default output mode (Output the most likely class) when configuring the recipe, the column indicates for each row a single class, which corresponds to the class with the highest score.

If you had set the output mode to Output all classes, the llm_prediction column would include some JSON with the score for each class.

Screenshot of the LLM prediction when outputting all classes.

Classify news articles using your own classes#

Now, let’s classify news articles using classes we’ll provide to the model.

Create the classification recipe#

The first step is to create the classification recipe from your dataset.

To do so:

  1. From the Flow, in the Classification using user-specified classes Flow zone (in red), select the reuters_headlines dataset.

  2. In the Actions tab of the right panel, under the LLM recipes section, click the Classify text recipe. It opens the text classification use case dialog.

  3. Click on LLM classification.

  4. In the New classification with user-defined classes recipe dialog:

    • Keep the input dataset.

    • Name the output dataset reuters_classified.

    • Store it in the location of your choice.

  5. Click Create recipe. This opens the settings page of the classification recipe.

Configure the sentiment analysis recipe#

Now, you must configure the recipe.

  1. In the LLM dropdown, select the LLM you want to use for sentiment analysis. In this tutorial, we use the GPT 3.5 model. Remember that the dropdown only lists connections that your administrator has previously set in the Administration panel > Connections > New connection > LLM Mesh.

  2. In the Input column field, select the Headlines column.

  3. In the Classes section, enter the following classes:

    • Business

    • Technology

    • Sports

    • Science

    • Health

    • Politics

  4. Leave the rest as is.

Note

If you select an LLM other than Chat GPT 3.5, the options may vary.

Run the recipe and explore the output#

Now that we’re all set, let’s run the recipe and explore the output dataset.

  1. Still in the recipe settings page, click Run.

  2. Once finished, click the Explore dataset reuters_classified link at the bottom of the page to open the output dataset.

Screenshot of the output dataset.

The prediction is stored in the predicted_class column.

Upon configuring the recipe, if we had enabled the Explain class choice option, the predicted_class_explanation column would describe why the model has selected the predicted class.

Screenshot of the LLM prediction with explanation of class selection.

What’s next?#

Congratulations! You have classified some product reviews and news articles using the Classify text recipe.

You can explore other LLM features such as the Summarize text recipe to summarize long articles into shorter ones.