Tutorial | Build a multimodal knowledge bank for a RAG project#

Get started#

The Embed documents recipe in Dataiku allows you to augment Large Language Models (LLMs) with specialized internal knowledge from your organization to increase the relevance and accuracy of the model responses.

It can process documents of different formats (.pdf, .docx, .pptx, .txt and .md) and thus generates a multimodal knowledge bank.

In this tutorial, we’ll use the Retrieval Augmented Generation (RAG) approach to enrich an LLM with content from various Dataiku documentation resources to help Dataiku users find relevant answers to their questions.

Objectives#

In this tutorial, you will:

  • Use the Embed documents recipe to extract the data from selected pages of the Dataiku documentation and vectorize it into a multimodal knowledge bank.

  • Create a prompt in the Prompt Studio to evaluate the response from the augmented LLM.

Prerequisites#

To use the Embed documents recipe, you’ll need:

  • Dataiku 13.4 or later.

  • An Advanced Analytics Designer or Full Designer user profile.

  • A compatible code environment for retrieval augmented models.

  • A connection to a supported embedding model for text embedding in the Embed recipe.

  • A connection to a supported Generative AI model, which is the model that will be augmented. See LLM connections for details.

Tip

You don’t need previous experience with Large Language Models (LLMs), though it would be useful to read the article Concept | Embed recipes and Retrieval Augmented Generation (RAG) before completing this tutorial.

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Build a multimodal knowledge bank for a RAG project.

  4. Click Install.

  5. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Add the Embed documents recipe#

This section will guide you through the creation and configuration of the Embed documents recipe.

Create the Embed documents recipe#

Let’s first create the Embed document recipe from the data that’s stored in the managed folder.

To do so:

  1. In the Flow, select the dataiku_doc managed folder.

  2. Go to the Actions tab and click Embed documents under the Visual Recipes section.

  3. In the New embed documents recipe window:

    • Keep the selected input.

    • In the Vision language model field, select a vision LLM connection that will be used for the VLM extraction. This ensures that the model can process images from documents (such as .pdf, .docx, and .pptx) to generate a RAG-friendly summary. This summary helps keep the extracted content concise and more likely to fit within the embedding model’s size limits.

    • Name the output knowledge bank multimodal_knowledge_bank.

    • In the Embedding model field, select a model you can use for text embedding (i.e. text vectorization, which means encoding the semantic information into a numerical representation).

    • Keep the default ChromaDB vector store type.

  4. Click Create Recipe.

Screenshot of the creation page of an Embed recipe.

Configure the Embed documents recipe#

Now, it’s time to configure the recipe.

The settings page of the Embed documents recipe allows you to select the extraction strategy to use depending on the file types.

Screenshot of the creation page of an Embed recipe.

As shown above, by default, the recipe automatically applies the following embedding strategy to your documents:

  • VLM extraction for .pdf or .pptx files.

  • Structured text extraction for .txt or .md files.

Usually, no action is required, and you can move to the next stage. However, since our source folder also contains a Word document (.docx), let’s add an extraction rule and adjust some advanced settings.

  1. In the first block, click + Add a Condition.

  2. In the new condition that appears, enter .docx to add this extension.

  3. Click the Show advanced settings to check the extraction rules.

  4. In the Select a model to use field, ensure that the LLM connection is the one you selected upon creating the recipe, for the summarization of your documents.

  5. Leave all other options unchanged and click Run to create the knowledge bank, which is the object that stores the output of the text embedding.

Screenshot of the creation page of an Embed recipe.

Once the run is over, if you look at the Flow, you’ll see two outputs of the Embed document recipe:

Output

Description

dataiku_doc_embedded_images

Includes all images taken from the documents, the content of which has been extracted using the VLM extraction.

multimodal_knowledge_bank

Stores the output of the text embedding.

Configure the knowledge bank#

Now that the Embed documents recipe has created the knowledge bank object in the Flow, let’s configure it to augment the LLM.

  1. In the Usage tab, to configure the LLM that we’ll augment with the content of the Dataiku documentation, click the + Add Augmented LLM button and fill in the fields as defined below:

    • Leave the generated Augmented LLM ID field as is or enter any other ID you wish using the pen icon.

    • In the LLM field, select the LLM that you want to augment (here, GPT-4o).

    • Set the Number of documents to retrieve option to 5.

    • Enable the Improve diversity of documents option and keep the default values for the diversity options.

    • Keep the Print document sources option set to Metadata & retrieval content to ensure that Dataiku adds to LLM responses details on the sources used to generate each response.

    • Keep the other options unchanged.

    Screenshot of the Usage tab of the knowledge bank.
  2. Click Save.

With this configuration, we augment the LLM with the content of the articles stored initially in the dataiku_doc folder and ask that the LLM uses the five top documents among the 20 documents closest to the query to build an answer in plain text.

As we enabled the Print document sources option, when testing the augmented LLM in the Prompt Studio (see section below), Dataiku will display the five sources that the model used to generate the answer.

Test the augmented LLM in a Prompt Studio#

Now, let’s see how the augmented LLM responds to a prompt.

Create the Prompt Studio#

The first thing to do is create a Prompt Studio.

  1. Go back to the Usage tab of the knowledge bank and click Test in Prompt Studio next to the augmented LLM.

    Tip

    You can also access the Prompt Studios by selecting Visual ML (ML Analysis icon.) > Prompt Studios in the top navigation bar.

  2. Give the new studio the name dataiku_documentation.

  3. Click Create. It opens the Prompt design page.

Design a prompt#

On the Prompt design page, we’ll add our prompt text and run a test using the augmented LLM.

  1. In the left panel of the studio, click + Add Prompt.

  2. Select Managed mode > Blank template > Create.

  3. In the LLM option, ensure that the augmented LLM in the Retrieval augmented section is selected.

    Note

    The name for an augmented LLM is Retrieval of <knowledge_bank_id>, using <augmented_model_name>.

    If you augment the same model more than once using the same knowledge bank, the LLM ID you set is added: Retrieval of <knowledge_bank_id> (id: <llm_id>) using <augmented_model_name>.

    Screenshot of the LLM selection in a Prompt Studio.
  4. In the Prompt field, copy and paste the following prompt. Use the Copy button at the right of the block for easier copying.

    You're an expert in Dataiku and rely on the knowledge from the Dataiku documentation. When answering questions, be sure to provide answers that reflect the content of the documentation, but avoid saying things like 'according to the documentation'. Instead, subtly mention that the information is based on the Dataiku documentation.
    
  5. On the right, in the Inputs from dropdown menu, select Written test cases.

    Note

    Feel free to rename the Description field. Instead of input, you can enter Question for instance. This changes the column header under Test cases.

  6. Under Test cases, click + Add Test Case to add a test to gauge how our model runs the prompt as it is.

  7. Copy and paste the following text into the input box:

    What's the difference between the Group and Window recipes?
    
  8. Click Run Prompt to pass the prompt and test case to your selected model.

Screenshot of a test case in a Prompt Studio.

The results may differ from those above, as LLMs don’t generate the same response every time.

Concretely, here’s what happened when you run the test:

  1. Based on the initial prompt you enter, the knowledge bank identifies up to five chunks of text that are similar to the prompt. You see them in the Sources section at the bottom of the response.

    Note

    Why five? This is because we set the Documents to retrieve option in the Usage tab of the knowledge bank to 5. However, if multiple chunks come from the same document page, there may be fewer than five unique sources.

  2. For each identified chunk, the system retrieves the content stored for retrieval:

    • If using a Vision Language Model (VLM) engine, this is the image of the corresponding document page.

    • If using a Text engine, this is the pre-chunked extracted text (also called unchunked text).

  3. This retrieved content is added to the prompt.

  4. The LLM generates a response based on this augmented prompt.

  5. Dataiku adds the metadata if any in the Sources section at the bottom of the response.

Next steps#

Now that you know how to augment an LLM with your specific knowledge, you could follow Tutorial | Build a conversational interface with Dataiku Answers.

See also

For more information: