Tutorial | Build a multimodal knowledge bank for a RAG project#

Get started#

The Embed documents recipe in Dataiku allows you to augment a Large Language Model (LLM) with specialized internal knowledge from your organization to increase the relevance and accuracy of the model’s responses.

This recipe can process documents of different formats (.pdf, .docx, .pptx, .txt, .md, .html, .png or .jpg) and thus generate a multimodal knowledge bank.

In this tutorial, you’ll use the Retrieval Augmented Generation (RAG) approach to enrich an LLM with content from various Dataiku documentation resources to help users find relevant answers to their questions.

Objectives#

In this tutorial, you will:

  • Use the Embed documents recipe to extract the data from a variety of file formats and vectorize it into a multimodal knowledge bank.

  • Create an augmented LLM based on this knowledge bank.

  • Use a Prompt Studio to test responses from this augmented LLM.

Prerequisites#

To use the Embed documents recipe, you’ll need:

Tip

You don’t need previous experience with Large Language Models (LLMs), though it would be useful to read the article Concept | Embed recipes and Retrieval Augmented Generation (RAG) before completing this tutorial.

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select RAG with the Embed Documents Recipe.

  4. If needed, change the folder into which the project will be installed, and click Install.

  5. From the project homepage, click Go to Flow (or type g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Extract your knowledge#

Before creating an augmented LLM, you must first create a knowledge bank. This section will guide you through the creation and configuration of the Embed documents recipe.

Create the Embed documents recipe#

First, create the Embed documents recipe from the various types of documents stored in the managed folder.

  1. In the Flow, select the dataiku_doc managed folder.

  2. Go to the Actions (Actions icon.) tab of the right panel, and click Embed documents under the Visual Recipes section.

  3. In the New embed documents recipe window, name the output knowledge bank knowledge_bank.

  4. In the Embedding model field, select an LLM you can use for text embedding (that is, text vectorization, which means encoding the semantic information into a numerical representation).

  5. Keep the default ChromaDB vector store type.

  6. Click Create Recipe.

Screenshot of the creation page of an Embed recipe.

Configure the Embed documents recipe#

The Embed documents recipe can apply different extraction strategies depending on your use case.

Extraction Type

Description

Text-only

Extracts only text, ignoring elements like images and complex tables.

Visual

Uses a vision LLM to transform pages into images that are then summarized.

Custom

Takes full control over the extraction process by defining rules of which strategy to apply based on the file extension, name, or path.

See also

Learn more about Embedding and searching documents in the reference documentation.

In this use case, the dataiku_doc folder includes documents like data-quality-flipbook.pdf, which has important data inside infographics. To extract this information into a knowledge bank, you’ll need to use visual extraction.

  1. In the recipe’s Settings tab, confirm Visual Extraction is selected.

  2. Under Visual Extraction, confirm the selection of the VLM you wish to use.

  3. Click Run to execute the recipe.

    Screenshot of the creation page of an Embed recipe.
  4. Once the recipe has finished running, look at the Flow. You’ll see two outputs for the Embed documents recipe:

Output

Description

dataiku_doc_embedded_images

Includes all images taken from the documents, the content of which has been extracted using the VLM extraction. If you had chosen text extraction, this folder would be empty.

knowledge_bank

Stores the output of the text embedding.

Augment an LLM with your knowledge#

Now you have a knowledge bank object in the Flow.

You can use a knowledge bank in a variety of ways, such as creating a chat application, an agent tool, or query it directly with Python code. However, you might also use it to create a retrieval-augmented LLM. Do that next.

Create an augmented LLM#

To enable an LLM to generate responses based on internal documents, you can augment it with a knowledge bank.

  1. From the Flow, double-click to open the knowledge_bank object.

  2. Within the Usage tab of the knowledge bank, click + Create a Retrieval-Augmented LLM.

  3. Select your previous LLM.

  4. Click OK to create a new retrieval-augmented LLM.

Screenshot of the Usage tab of the knowledge bank.

Configure an augmented LLM#

You can further design an augmented LLM in a variety of ways depending on your needs.

In this example, ask the LLM to build an answer in plain text after searching its knowledge bank for the top five documents among the 20 documents closest to the query.

  1. Within the Design tab of the augmented LLM, for the Search type field, select Improve diversity of documents.

  2. Set the Number of documents to retrieve option to 5.

  3. Click Save.

  4. Click Test in Prompt Studio.

  5. Name it dataiku_documentation.

  6. Click Create.

Dataiku screenshot of the dialog to create a Prompt Studio from the Design tab of a retrieval-augmented LLM.

See also

You can find more details about the design settings of an augmented LLM in the Advanced Search section of the reference documentation.

Test an augmented LLM#

Once you have an augmented LLM, it functions much like any other LLM. Accordingly, you can use it in a Prompt Studio.

Prompt an augmented LLM without inputs#

The Prompt Studio already includes an empty prompt. First, use a specific fact from a document in the folder to confirm that the augmented LLM has absorbed the folder’s knowledge.

  1. In the Agent/LLM field of the Prompt Studio, confirm the Retrieval of knowledge bank, using <YOUR LLM> is selected.

  2. In the Prompt field, copy and paste the following prompt.

    What percentage of senior analytics and IT leaders cite data quality and usability as their main challenge with data infrastructure?
    Return only a two-digit percentage.
    
  3. Click Run Prompt.

  4. Confirm that, in addition to a response, the result includes sources.

Dataiku screenshot of a Prompt Studio without inputs.

Tip

By nature, generative AI doesn’t return reproducible results. However, you hopefully received an answer of 45%, citing page 2 of the data-quality-flipbook.pdf as the source. Try asking this same question to your original non-augmented LLM, as well as an LLM augmented with a knowledge bank created using only text extraction. You’ll almost certainly receive different responses!

Design a prompt for an augmented LLM#

Having confirmed the augmented LLM works as expected, you can continue testing it.

  1. In the left panel of the studio, click + Add Prompt > Managed mode (Blank template) > Create.

  2. In the Prompt field, copy and paste the following prompt.

    You're an expert in Dataiku and rely on the knowledge from the Dataiku documentation.
    When answering questions, be sure to provide answers that reflect the content of the documentation, but avoid saying things like 'according to the documentation'.
    Instead, subtly mention that the information is based on the Dataiku documentation.
    
  3. On the right, in the Inputs from dropdown menu, select Written test cases.

  4. Rename the Description field as Question to change the column header under Test cases.

  5. Under Test cases, click + Add Test Case to gauge how the augment LLM runs the prompt as it is.

  6. Copy and paste the following text into the Question box:

    What's the difference between the Group and Window recipes?
    
  7. Click Run Prompt to pass the prompt and test case to the augmented LLM.

  8. Examine the Result, including not only an answer but also five sources.

Dataiku screenshot of a test case in a Prompt Studio.

Quick recap of prompting an augmented LLM#

Concretely, here’s what happened when you ran the test:

  1. The knowledge bank identifies up to five chunks of text that are similar to the prompt. You see them in the Sources section at the bottom of the response.

    Note

    Why five? This is because you set the Number of documents to retrieve option in the Design tab of the augmented LLM to 5. However, if multiple chunks come from the same document page, there may be fewer than five unique sources.

  2. For each identified chunk, the system retrieves the content stored for retrieval:

    • If using a Vision Language Model (VLM) engine, this is the image of the corresponding document page.

    • If using a Text engine, this is the pre-chunked extracted text (also called unchunked text).

  3. This retrieved content is added to the prompt.

  4. The LLM generates a response based on this augmented prompt.

  5. Dataiku adds the metadata, if any, in the Sources section at the bottom of the response.

Next steps#

Now that you know how to augment an LLM with your specific knowledge, you could follow Tutorial | Build a conversational interface with Dataiku Answers.

See also

For more information: