Tutorial | Processing text with the Prompt recipe#

The Prompt recipe allows you to write prompts, or instructions to a large language model (LLM), to process datasets.

Let’s see how the recipe can be used to process text in different ways!

Get started#

Objectives#

In this tutorial, you will:

  • Use an LLM connection and the Prompt recipe to translate a database of product reviews.

  • Extract entities from the reviews also using an LLM.

  • Validate results from the model.

Prerequisites#

To complete this tutorial, you will need:

  • Dataiku 12.5 or later.

  • An Advanced Analytics Designer or Full Designer user profile.

  • A connection to at least one supported generative AI model. Your administrator must configure the connection beforehand in the Administration panel > Connections > New connection > LLM Mesh. Supported model connections include models such as OpenAI, Hugging Face, Cohere, etc.

  • No prior knowledge of working with large language models (LLMs), though it would be useful to read the article Concept | Prompt Studios and Prompt recipe before completing this tutorial.

Create the project#

This tutorial uses a dataset of product reviews from online shoppers. We’ll work with a small subset reviews to reduce computation cost.

To create the project:

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > ML Practitioner > LLM - Prompt Recipe.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Translate text#

Let’s say we want to translate the online product reviews from English into French.

The dataset of product reviews to process.

Create the Prompt recipe#

Start by creating a Prompt recipe to process this data.

  1. From within the dataset or with it selected in the Flow, navigate to the Actions panel.

  2. Under LLM Recipes, select the Prompt recipe.

  3. Change the name of the Output dataset to product_reviews_translated.

  4. Click Create recipe.

Steps to create the Prompt recipe.

Write the prompt#

In the recipe settings, you can write a prompt directly.

  1. Choose your LLM connection. The options depend on your connections; in this tutorial we will use the GPT-4o mini model from OpenAI.

  2. In the Prompt box, copy and paste the following text:

Translate the product reviews into French.
  1. Click Add inputs and add the following:

  • Description: Product review

  • Column: text

  1. Click Save.

  2. Run the recipe.

  3. After the recipe has finished running, click Explore dataset product_reviews_translated.

Steps to write a translation prompt.

The output dataset contains an llm_output column with the model’s results of translated product reviews.

The output dataset from a translation prompt.

Now let’s write a more complex prompt!

Extract entities from text#

Entity extraction, also known as named entity recognition (NER), is a natural language processing (NLP) technique used to identify and extract important data points, such as names, locations, organizations, dates, or email addresses.

Entity extraction can help turn unstructured text into structured data that can be used more easily to search, analyze, or build models. It can also be used to help detect events and participants involved, to analyze sentiment around entities, or summarize texts.

Let’s use the Prompt recipe to extract useful information from product reviews.

Create the recipe#

We’ll create another Prompt recipe to process the same dataset (product_reviews) with a different prompt.

  1. Follow the same steps as above to create a new Prompt recipe.

  2. Name the Output dataset product_reviews_extracted.

Steps to create a new prompt recipe with the output dataset product_reviews_extracted.

Write the prompt#

This prompt will instruct the LLM to extract information from the reviews, including the name of the product, sentiment, a summary, and other useful information. The output format will be in JSON, which we could then parse and analyze more easily than the raw text.

  1. Choose an LLM connection.

  2. In the Prompt box, copy and paste the following text:

 Extract the key information from the below product reviews in JSON format (no Markdown). Your answer should follow the JSON schema below:
{
 "title": "Product review",
 "description": "A product review",
 "type": "object",
 "properties": {
   "name": {
     "description": "The name of the product covered in the review. 'Unknown' if it cannot be determined",
     "type": "string"
   },
   "summary": {
     "description": "A summary of the review",
     "type": "string"
   },
   "predicted_sentiment": {
     "description": "Whether the product review is positive, negative or neutral",
     "type": "string",
     "enum": ["positive", "neutral", "negative"],
   },
   "advantages": {
     "description": "Advantages of the products mentioned in the review. Return an empty list if the review doesn't mention an advantage",
     "type": "array",
     "items": {
       "type": "string"
     }
   }
 },
   "drawbacks": {
     "description": "Drawbacks of the products mentioned in the review. Return an empty list if the review doesn't mention a drawback",
     "type": "array",
     "items": {
       "type": "string"
     }
   }
 },
 "required": [ "name", "summary", "sentiment", "advantages", "drawbacks"]
}
  1. Click Add inputs and add the following:

  • Description: Product review

  • Column: text

  1. Click Save.

  2. Run the recipe.

  3. After the recipe has finished running, click Explore dataset product_reviews_extracted.

Steps to write a prompt and give the LLM input.

The recipe created several columns. The llm_output column contains the json output we instructed the model to give. Notice there is also an llm_validation_status column with no values.

The first output dataset.

Note

Your results might differ from those shown because models can process the same input differently.

Create validation settings#

Let’s update the recipe settings to validate that the model’s results are in the JSON format we specified in the prompt. Results will be output in the llm_validation_status column.

  1. Click Parent recipe to return to the recipe settings.

  2. Scroll to Result validation settings > Expected output format and choose JSON object.

  3. Click Add key and enter the following keys:

  • name

  • summary

  • predicted_sentiment

  • advantages

  • drawbacks

  1. Save and Run the dataset, then navigate to the updated output dataset.

The first output dataset.

The output dataset now includes values in the llm_validation_status column. If you have any invalid responses the dataset will also contain a value in the llm_error_message column showing the issue.

In this example, the required JSON key predicted_sentiment was not present in one response from the model.

The first output dataset.

Tip

If you want to iterate on prompts or test different LLM performances without running them on the entire dataset, use the Prompt Studio to test results on dataset samples! You can access Prompt Studios directly from the Prompt recipe with the Edit in Prompt Studio button.