Concept | Retrieval Augmented Generation (RAG) approach and the Embed recipe#

While large language models (LLMs) are powerful tools, they often lack specific internal knowledge from organizations. The Embed recipe in Dataiku uses the Retrieval Augmented Generation (RAG) approach to help you fetch relevant pieces of text from a knowledge bank and enrich the user prompts with them. This improves the precision and relevance of answers returned by the LLMs.

As you can see on the diagram below, a benefit of the RAG approach is that to gain precision, you do not have to fine-tune a model, which is time-consuming and expensive. You just have to add your internal knowledge to a knowledge bank and feed the LLM with it.

Diagram showing the four levels of LLM customization with Dataiku.

Prerequisites#

To use the Embed text recipe, you’ll need:

  • A Dataiku instance (version 12.3 and above). Dataiku Cloud is compatible.

  • A compatible code environment for retrieval-augmented models. This environment must be created beforehand by an administrator and include the Retrieval Augmented Generation models package.

    Screenshot of a code environment with the Retrieval Augmented Generation models package.
  • A connection to a supported embedding model, which will be used for text embedding in the Embed recipe. Note that in 12.3, only OpenAI is supported for embedding models.

  • A connection to a supported generative AI model, which is the model that will be augmented. See LLM connections for details.

RAG pipeline for question-answering using an augmented LLM#

Let’s see how the RAG approach works.

Diagram of the RAG pipeline in a question-answering application.

As you can see in the diagram above, using an augmented LLM includes several steps:

  1. Gather a corpus of documents that will serve as the bespoke information you’ll augment an LLM’s base knowledge with.

    For instance, these might be internal policy or financial documents, technical documentation, or research papers about a certain topic.

    Note

    If your textual data is in PDF, HTML, or another document format, you can first apply a text extraction recipe to transform the files into a tabular dataset with each document’s extracted text captured in its own row.

  2. The Embed recipe uses vector stores such as FAISS, Pinecone, or ChromaDB to break your textual data into smaller chunks and then vectorize it (i.e. encode the semantic information into a numerical representation). The output of the Embed recipe is stored in a knowledge bank that is optimized for high-dimensional semantic search.

    Note

    The numeric vectors are commonly known as text embeddings, hence the recipe’s name.

  3. When a user asks a question, the smart retriever uses the vector store to augment the provided question with relevant information from the vector store.

  4. The augmented prompt is used to query the large language model to provide more precise answers along with their sources.

Embed recipe settings#

As the name puts it, the Embed recipe is used to manage the text embedding part (i.e. text vectorization) of the RAG approach.

The recipe settings page is where you:

  1. Indicate the column from the input dataset that contains the text to embed, or convert from textual data into numerical vectors.

  2. Configure the splitting methods and chunks of the input data.

  3. Set your knowledge bank by selecting the LLM and vector store to use for text embedding.

Screenshot of the settings page of an Embed recipe.

Knowledge bank settings#

The knowledge bank is the output of the Embed recipe. This is where your input textual data has been converted into numerical vectors to augment the LLM you want to use.

On the Flow, it is represented as a pink square object.

The table below describes the different tabs used to configure the knowledge base.

Tab

Description

Use

This tab is meant to configure the LLMs that will be augmented with the content of the knowledge bank.

It allows you to:

  • Select the LLMs to augment using the knowledge bank. The chosen LLMs will then have a new version available in the Prompt Studio and Recipe, under the Retrieval augmented section in the list of available LLMs.

  • Indicate the number of documents (i.e. chunks from the input Knowledge column) to send to the LLM. These documents are retrieved based on their similarity to the text in the LLM query.

  • Configure diversity settings to ensure the retrieval of diverse documents, which is particularly beneficial if the knowledge bank contains duplicate entries.

  • Enable the Print sources option so that the LLM exposes the metadata of the documents used to generate the response when answering a prompt. To use this option, you must first set a metadata column in the Embed recipe settings to provide the information to the final users.

  • Define the output format: plain text or JSON according to your needs.

Screenshot of the Use tab of a knowledge bank.

In the example above, you are augmenting the Chat GPT 3.5 LLM and you ask that, among the 20 documents closest to the query, the LLM uses the five top documents to build an answer in plain text. As you enable the Print sources option, when testing it in the Prompt Studio, the answer will indicate the five sources used to generate the answer.

Core settings

This tab allows you to edit the embedding method and indicate the vector store used to store the vector representation of the textual data.

By default, the embed recipe uses the FAISS vector store.

Screenshot of the Core settings tab of a knowledge bank.

Embedding settings

This tab is a snapshot of the parent recipe settings upon the last run.

Screenshot of the Embedding settings tab of a knowledge bank.

Flow settings

This tab allows you to define whether you want to automatically recompute the knowledge bank when building a downstream dataset in the Flow.

Screenshot of the Flow settings tab of a knowledge bank.

Testing in the Prompt Studio#

Once you’ve augmented an LLM with the content of your knowledge bank, this LLM is available in the Prompt Studios and Prompt recipe under a section named Retrieval Augmented.

Note

Retrieval-augmented LLMs are only available in the scope of the project and inherit the LLM connection’s properties of the underlying LLM (caching, filters, permissions, etc.).

You can use a Prompt Studio to test your augmented model with some prompts and evaluate the responses.

Screenshot of a test case in a Prompt Studio.

Note that the LLM indicates its sources at the bottom of the response. This is a key capability and an advantage of the RAG approach.

As you may know, generic LLMs can sometimes offer answers that sound plausible but are, in fact, hallucinations — statements that seem true but aren’t backed up by real data. This is a major risk when accurate and credible information is not just a nice-to-have, but a must-have. Because documentation of your own knowledge bank is clearly displayed with each answer, the application delivers insights that your teams can verify and trust.

What’s next?#

Continue learning about the Embed recipe and the RAG approach by working through the Tutorial | Use the Retrieval Augmented Generation (RAG) approach for question-answering article.