Tutorial | Use the Retrieval Augmented Generation (RAG) approach for question-answering#
The Embed recipe in Dataiku allows you to augment large language models (LLMs) with some specialized internal knowledge from your organization to increase the relevance and accuracy of the models’ responses.
In this tutorial, we’ll use the Retrieval Augmented Generation (RAG) approach to augment the ChatGPT 3.5 model from OpenAI with the content from the Dataiku Knowledge Base to help Dataiku users find relevant answers to their questions.
Get Started#
Objectives#
In this tutorial, you will:
Use the Embed recipe to vectorize the textual data from the Dataiku Knowledge Base.
Create a prompt in the Prompt Studio to check the response from the augmented LLM.
Prerequisites#
To use the Embed text recipe, you’ll need:
Dataiku 12.3 or later.
An Advanced Analytics Designer or Full Designer user profile.
A compatible code environment for retrieval-augmented models. This environment must be created beforehand by an administrator and include the Retrieval Augmented Generation models package.
A connection to a supported embedding model , that will be used for text embedding in the Embed recipe.
A connection to a supported generative AI model, which is the model that will be augmented. See LLM connections for details.
Create the project#
To create the project:
From the Dataiku Design homepage, click +New Project > DSS tutorials > ML Practitioner > LLM - Question Answering with RAG Approach.
From the project homepage, click Go to Flow (or
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
Create the Embed recipe#
The first step is to attach the Embed recipe to the input dataset. In this tutorial, the dataset we’ll use includes the URL and text of each Dataiku Knowledge Base article.
Let’s use the Embed text recipe to vectorize the content of the cleaned_text column.
To do so:
From the Flow, select the dataiku_knowledge_base dataset and in the Actions tab, click Embed under the LLM Recipes section.
In the New embedding recipe window:
Keep the selected input dataset.
Name the output knowledge bank
knowledge-bank
.In the Embedding model field, select a model you can use for text embedding (i.e. text vectorization, which means encoding the semantic information into a numerical representation).
Click Create Recipe. It opens the recipe settings page.
Configure the Embed recipe and knowledge bank#
Now, let’s see how to configure the recipe for text embedding and the knowledge bank, which is the object that stores the output of the text embedding.
In the Knowledge column option, select the cleaned_text column, which is the column that includes the textual data to vectorize.
Under Metadata columns, click the + Add button and select the url column as the column that stores the metadata Dataiku will use to identify this source when it is used in LLM responses.
Click Save.
Still from the Embed settings page, in the Knowledge bank settings section, click Edit. This opens the knowledge bank settings page.
In the Use tab, to configure the LLM that we’ll augment with the content of the knowledge bank, click the + Add Augmented LLM button and fill in the fields as below:
Leave the generated Augmented LLM ID field or enter any other ID you wish.
In the LLM field, select the LLM that you want to augment (here, GPT 3.5 Chat).
Set the Documents to retrieve option to
5
.Enable the Improve diversity of documents option and keep the default value for the diversity options.
Keep the Print sources option enabled to ensure that Dataiku adds to LLM responses details on the sources used to generate each response.
Keep the source output format as is.
In the Core settings tab:
Keep the embedding model. This is the one you selected upon creating the Embed recipe.
Keep the default FAISS vector store type.
Set the Code env option to Select an environment.
In the Environment option, select a code environment that includes the Retrieval Augmented Generation models package.
Click Save then Parent Recipe to go back to the Embed recipe settings page.
Click Run to start executing the Embed recipe.
With this configuration, we are augmenting the Chat GPT 3.5 LLM with the content from the cleaned_text column of the Dataiku knowledge base dataset and we ask that, among the 20 documents closest to the query, the LLM uses the five top documents to build an answer in plain text.
As we enabled the Print sources option, when we test the augmented LLM in the Prompt Studio, Dataiku will indicate the five sources used to generate the answer.
Test the augmented LLM in a Prompt Studio#
Now, let’s see how the augmented LLM responds to a prompt.
Create the Prompt Studio#
The first thing to do is create a Prompt Studio.
In the top navigation bar, select Visual Analyses > Prompt Studios.
Click New Prompt Studio in the top right and give the new studio the name
dataiku-knowledge-base
, then click Create.In the Add a new prompt window, select Managed mode.
From the Templates that appear below, leave the default Blank template.
Click Create. It opens the Prompt design page.
Design a prompt#
On the Prompt design page, we’ll add our prompt text and run a test using the augmented LLM.
In the studio, in the LLM option, select the augmented LLM in the Retrieval augmented section at the bottom of the dropdown.
Note
The name for the augmented LLMs is
`Retrieval of <knowledge_bank_id>, using <augmented_model_name>`
.If you augment the same model more than once using the same knowledge bank, the LLM ID you set is added:
`Retrieval of <knowledge_bank_id> (id: <llm_id>) using <augmented_model_name>`
.In the main input field, copy and paste the following prompt, replacing the text Explain here what the model must do. Use the Copy button at the top right of the block for easier copying.
You're an expert in Dataiku and rely on the knowledge from the Dataiku knowledge base. When answering questions, be sure to provide answers that reflect the content of the knowledge base, but avoid saying things like 'according to the knowledge base'. Instead, subtly mention that the information is based on the Dataiku knowledge base.
On the right, create one Inputs by writing
Question
in the Description box.In the Inputs from dropdown menu, select Written test cases.
The Question input now appears as a column header under Test cases. We’ll add one test case to gauge how our model runs the prompt as it is.
Click Add Test Case and copy and paste the following text into the Question box:
How do ML assertions and overrides differ?
Click Run Prompt to pass the prompt and test case to your selected model.
Depending on the model you selected, you might get different results.
Concretely, here’s what happened upon running the test:
Based on the initial prompt you enter, the knowledge bank identifies five chunks of text that are similar to the prompt.
Note
Why five? This is because we asked to retrieve only five documents in the Documents to retrieve option of the Use tab of the knowledge bank.
These five text chunks are fetched from the knowledge bank and added to the prompt.
The LLM generates a response based on this augmented prompt.
Dataiku adds the metadata (here, the original article URLs and raw content) in the Sources section at the bottom of the response.
What’s next?#
Now that you know how to augment an LLM with your specific knowledge, you could create a dataset with some questions to use it for test cases in a Prompt Studio, then create a Prompt recipe from it.
For more information:
On the Embed recipe and the RAG approach, see the Concept | Embed recipe and Retrieval Augmented Generation (RAG) article.
On prompt engineering, see the Tutorial | Prompt engineering with LLMs article.