Tutorial | Classify text with Generative AI#
Dataiku allows you to use Large Language Models (LLMs) to classify your data using a visual recipe named Classify text. This automates the process so you don’t have to manually label the rows of your dataset like in supervised learning. Instead, depending on your use case, you can use models with preset classes or tell a model to use a list of classes you define, and the LLM will classify each row of your dataset.
As classification can use either task-based classes or classes specified by the user, this tutorial will guide you through two use cases:
Analyzing the sentiments in some product reviews using sentiment analysis.
Classifying news articles using user-specified classes.
Get started#
Objectives#
In this tutorial, you will use an LLM classification with:
Model-defined classes to classify product reviews into three classes: positive, negative or neutral.
User-defined classes to classify some Reuters articles into custom classes, based on the headlines.
Prerequisites#
To complete this tutorial, you’ll need:
Dataiku 12.3 or later.
An Advanced Analytics Designer or Full Designer user profile.
A connection to at least one supported Generative AI model. Your administrator must configure them beforehand in the Administration panel > Connections > New connection > LLM Mesh. Supported model connections include models such as OpenAI, Hugging Face, Cohere, etc.
Important
Using an LLM has a cost and the cost depends on the LLM you select. In this tutorial, to limit the cost, we’re using the Chat GPT 3.5 model from OpenAI and small datasets. Feel free to select another LLM as you can follow this tutorial regardless of the your LLM choice.
Tip
You do not need previous experience with Large Language Models (LLMs), though it would be useful to read the article Concept | Classify text recipe before completing this tutorial.
Create the project#
From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select LLM - Classification.
Click Install.
From the project homepage, click Go to Flow (or
g
+f
).
From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by Gen AI Practitioner.
Select LLM - Classification.
From the project homepage, click Go to Flow (or
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
This tutorial uses a dataset of articles from the Reuters news agency that is publicly available on Kaggle. We’ll work with a subset of 1,000 articles to reduce computation cost.
As shown below, the project includes two Flow zones, one for each use case.
Analyze the sentiments in product reviews#
Let’s start by classifying product reviews using a sentiment analysis. In the end, all reviews will be tagged as positive, negative, or neutral.
Create the classification recipe#
The first step is to create the classification recipe from your dataset.
To do so:
From the Flow, in the Classification using task-specific classes Flow zone (in blue), select the product_reviews dataset.
In the Actions tab of the right panel, under the LLM recipes section, click the Classify text recipe. It opens the text classification use case dialog.
Click on Sentiment analysis.
In the New classification with model-defined classes recipe dialog:
Keep the input dataset.
Name the output dataset
product_review_sentimental_analysis
.Store it in the location of your choice.
Click Create recipe. This opens the settings page of the classification recipe.
Configure the sentiment analysis recipe#
Now, you must configure the recipe.
In the LLM dropdown, select the LLM you want to use for sentiment analysis.
Note
In this tutorial, we use the GPT 3.5 model but feel free to use any other LLMs depending on what is available at your end. Remember that the dropdown only lists connections that your administrator has previously set in the Administration panel > Connections > New connection > LLM Mesh.
In the Input column field, select the text column. It includes the clients’ reviews we want to analyze.
In the Task field, select Sentiment analysis (+/=/-).
Leave the output mode as is.
Note
If you select an LLM other than Chat GPT 3.5, the options may vary.
Run the recipe and explore the output#
Now that we’re all set, let’s run the recipe and explore the output dataset.
Still in the recipe settings page, click Run.
Once finished, click the Explore dataset product_review_sentimental_analysis link at the bottom of the page to open the output dataset.
The prediction is stored in the llm_prediction column. As we kept the default output mode (Output the most likely class) when configuring the recipe, the column indicates for each row a single class, which corresponds to the class with the highest score.
If you had set the output mode to Output all classes, the llm_prediction column would include some JSON with the score for each class.
Classify news articles using your own classes#
Now, let’s classify news articles using classes we’ll provide to the model.
Create the classification recipe#
The first step is to create the classification recipe from your dataset.
To do so:
From the Flow, in the Classification using user-specified classes Flow zone (in red), select the reuters_headlines dataset.
In the Actions tab of the right panel, under the LLM recipes section, click the Classify text recipe. It opens the text classification use case dialog.
Click on LLM classification.
In the New classification with user-defined classes recipe dialog:
Keep the input dataset.
Name the output dataset
reuters_classified
.Store it in the location of your choice.
Click Create recipe. This opens the settings page of the classification recipe.
Configure the classification recipe#
Now, you must configure the recipe.
In the LLM dropdown, select the LLM you want to use for sentiment analysis. In this tutorial, we use the GPT 3.5 model. Remember that the dropdown only lists connections that your administrator has previously set in the Administration panel > Connections > New connection > LLM Mesh.
In the Input column field, select the Headlines column.
In the Classes section, enter the following classes:
Business
Technology
Health
Politics
Leave the rest as is.
Note
If you select an LLM other than Chat GPT 3.5, the options may vary.
Run the recipe and explore the output#
Now that we’re all set, let’s run the recipe and explore the output dataset.
Still in the recipe settings page, click Run.
Once finished, click the Explore dataset reuters_classified link at the bottom of the page to open the output dataset.
The prediction is stored in the predicted_class column.
Upon configuring the recipe, if we had enabled the Explain class choice option, the predicted_class_explanation column would describe why the model has selected the predicted class.
What’s next?#
Congratulations! You have classified some product reviews and news articles using the Classify text recipe.
You can explore other LLM features such as the Summarize text recipe to summarize long articles into shorter ones.