Tutorial | Use the Retrieval Augmented Generation (RAG) approach for question-answering#
The Embed recipe in Dataiku allows you to augment large language models (LLMs) with some specialized internal knowledge from your organization to increase the relevance and accuracy of the models’ responses.
In this tutorial, we’ll use the Retrieval Augmented Generation (RAG) approach to augment the ChatGPT 3.5 model from OpenAI with the content from the Dataiku Knowledge Base to help Dataiku users find relevant answers to their questions.
In this tutorial, you will:
Use the Embed recipe to vectorize the textual data from the Dataiku Knowledge Base.
Create a prompt in the Prompt Studio to check the response from the augmented LLM.
To use the Embed text recipe, you’ll need:
A Dataiku instance (version 12.3 and above). Dataiku Cloud is compatible.
A compatible code environment for retrieval-augmented models. This environment must be created beforehand by an administrator and include the Retrieval Augmented Generation models package.
A connection to a supported embedding model, that will be used for text embedding in the Embed recipe. Note that in 12.3, only OpenAI is supported for embedding models.
A connection to a supported generative AI model, which is the model that will be augmented. See LLM connections for details.
Create the project#
To create the project:
From the Dataiku Design homepage, click +New Project > DSS tutorials > ML Practitioner > LLM - Question Answering with RAG Approach.
From the project homepage, click Go to Flow (or
You can also download the starter project from this website and import it as a zip file.
Create the Embed recipe#
The first step is to attach the Embed recipe to the input dataset. In this tutorial, the dataset we’ll use includes the URL and text of each Dataiku Knowledge Base article.
Let’s use the Embed text recipe to vectorize the content of the cleaned_text column.
To do so:
From the Flow, select the dataiku_knowledge_base dataset and in the Actions tab, click Embed under the LLM Recipes section.
In the New embedding recipe window:
Keep the selected input dataset.
Name the output knowledge bank
In the Embedding model field, select a model you can use for text embedding (i.e. text vectorization, which means encoding the semantic information into a numerical representation).
Click Create Recipe. It opens the recipe settings page.
Configure the Embed recipe and knowledge bank#
Now, let’s see how to configure the recipe for text embedding and the knowledge bank, which is the object that stores the output of the text embedding.
In the Knowledge column option, select the cleaned_text column, which is the column that includes the textual data to vectorize.
Under Metadata columns, click the + Add button and select the url column as the column that stores the metadata Dataiku will use to identify this source when it is used in LLM responses.
Still from the Embed settings page, in the Knowledge bank settings section, click Edit. This opens the knowledge bank settings page.
In the Use tab, to configure the LLM that we’ll augment with the content of the knowledge bank, click the + Add button and fill in the fields as below:
In the LLM field, select the LLM that you want to augment (here, GPT 3.5 Chat).
Set the Documents to retrieve option to
Enable the Improve diversity of documents option and keep the default value for the diversity options.
Keep the Print sources option enabled to ensure that Dataiku adds to LLM responses details on the sources used to generate each response.
Keep the source output format as is.
In the Core settings tab:
Keep the embedding model. This is the one you selected upon creating the Embed recipe.
Keep the default FAISS vector store type.
Set the Code env option to Select an environment.
In the Environment option, select a code environment that includes the Retrieval Augmented Generation models package.
Click Save then Parent Recipe to go back to the Embed recipe settings page.
Click Run to start executing the Embed recipe.
With this configuration, we are augmenting the Chat GPT 3.5 LLM with the content from the cleaned_text column of the Dataiku knowledge base dataset and we ask that, among the 20 documents closest to the query, the LLM uses the five top documents to build an answer in plain text.
As we enabled the Print sources option, when we test the augmented LLM in the Prompt Studio, Dataiku will indicate the five sources used to generate the answer.
Test the augmented LLM in a Prompt Studio#
Now, let’s see how the augmented LLM responds to a prompt.
Create the Prompt Studio#
The first thing to do is create a Prompt Studio.
In the top navigation bar, select Visual Analyses > Prompt Studios.
Click New Prompt Studio in the top right and give the new studio the name
dataiku-knowledge-base, then click OK.
In the studio, click Add prompt, then in the New Prompt window, select Prompt template, which can be used later to create a Prompt recipe.
In the LLM option, select the augmented LLM in the Retrieval augmented section.
Under Source for test cases, select Write queries directly.
Create the prompt.
Design a prompt#
On the Prompt design page, we’ll add our prompt text and run a test using the augmented LLM.
In the Task window, copy and paste the following prompt, replacing the existing explainer text (use the Copy button at the top right of the block for easier copying):
You're an expert in Dataiku and rely on the knowledge from the Dataiku knowledge base. When answering questions, be sure to provide answers that reflect the content of the knowledge base, but avoid saying things like 'according to the knowledge base'. Instead, subtly mention that the information is based on the Dataiku knowledge base.
On the right, create one Inputs by writing
Questionin the Description box. As you add this input, it appears as the header under Test cases. We’ll add one test case to gauge how our model runs the prompt as it is.
Click Add a test case and copy and paste the following text into the Question box:
How do ML assertions and overrides differ?
Click Run to pass the prompt and test case to your selected model.
Depending on the model you selected, you might get different results.
Concretely, here’s what happened upon running the test:
Based on the initial prompt you enter, the knowledge bank identifies five chunks of text that are similar to the prompt.
Why five? This is because we asked to retrieve only five documents in the Documents to retrieve option of the Use tab of the knowledge bank.
These five text chunks are fetched from the knowledge bank and added to the prompt.
The LLM generates a response based on this augmented prompt.
Dataiku adds the metadata (here, the original article URLs) in the Sources section at the bottom of the response.
Now that you know how to augment an LLM with your specific knowledge, you could create a dataset with some questions to use it for test cases in a Prompt Studio, then create a Prompt recipe from it.
For more information:
On the Embed recipe and the RAG approach, see the Concept | Retrieval Augmented Generation (RAG) approach and the Embed recipe article.
On prompt engineering, see the Tutorial | Prompt engineering with LLMs article.