Tutorial | Manage document-level permissions for RAG#

Get started#

When you use RAG enterprise-wide, you might need to provide access to specific documents for certain groups of people and restrict it for others. In that case, if a user tries to run a query without the required permissions, the LLM will ignore restricted data and return a general response instead. When using a SharePoint folder as the data source, Dataiku makes this possible with the List Access recipe, which retrieves permissions from Azure user groups and assigns the appropriate access rights to users in Dataiku. In this tutorial, you’ll see exactly how to prepare and set up the recipe to enable secure RAG access.

Objectives#

In this tutorial, you will:

  • Create a List Access recipe from your SharePoint folder.

  • Populate a knowledge bank with embedded access.

  • Leverage a RA model or an agent with a search tool in a chatbot interface.

Prerequisites#

To complete this tutorial, you’ll need:

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Manage Document-Level Permissions for RAG.

  4. If needed, change the folder into which the project will be installed, and click Create.

  5. From the project homepage, click Go to Flow (or type g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Create the list of permissions#

With an empty Flow, the first step is to create a folder with the files from your SharePoint folder.

Create the folder#

  1. From the Flow, click + Add Item > Connect or create > Folder.

  2. Name it sharepoint_files.

  3. Select a SharePoint Online connection as the Store into.

  4. Click Create.

  5. Navigate to the Settings tab.

  6. Input the different settings of your SharePoint folder containing your files.

  7. Click Save.

Dataiku screenshot of the sharepoint folder creation.

List access#

The List access recipe allows you to list the files inside a folder and the respective groups that have the access. More precisely, it lists the Entra groups from SharePoint and maps them to the Dataiku user groups. This enables you to add a layer of document-level security with security tokens.

  1. Select the created sharepoint_files folder.

  2. In the Actions (Actions icon.) panel, under Visual Recipe, click List Access.

  3. Name the output dataset list_access_dataset.

  4. Click Create Recipe.

  5. Click Run.

Dataiku screenshot of list access recipe creation.

Alternatively, if your instance doesn’t have a mapping with Azure AD groups, you can still provide a manual mapping dataset as an input. You need a dataset with two columns: DSS groups (string) and Entra groups (JSON array). For example:

DSS groups

Entra groups

Dss_group_1

[“Entra_group-1”, “Other_entra_group”]

Dss_group_2

[“Entra_group-2”]

The List Access recipe retrieves which group has access to which file based on the manual mapping that you have provided.

Embed your documents with security permissions#

  1. Select both the list_access_dataset and sharepoint_files folder.

  2. In the Actions (Actions icon.) panel, click Embed documents.

  3. Enter folder_access_embedding as the output name.

  4. Select the model and the vector store of your choice.

  5. Click Create Recipe.

Dataiku screenshot of embed documents recipe creation.

Here, the goal is to embed the documents while integrating permissions as metadata. You can choose from pre-built extraction options, such as text or visual, which support specific file types. However, in this case, you’ll want something custom to cover everything.

  1. Select Custom Rules.

  2. Click Add New Rule.

  3. Leave the file extension and Equals (case sensitive) and enter txt as the wanted file extension.

  4. Click + Add A Condition and add the following file extensions to be retained:

    • md

    • pdf

    • pptx

    • docx

    • html

  5. Navigate to Metadata Dataset.

  6. Select your Security tokens column amongst the columns.

  7. Click Run.

Dataiku screenshot of the embed recipe custom rules.

Note

If you start from a dataset containing document chunks and security tokens rather than a folder, use the Embed dataset recipe. The security token option is available under the recipe’s Advanced settings.

You now have created your knowledge bank populated with your security tokens. The next step is to provide a solution to retrieve them.

Retrieve your security tokens#

In Dataiku, there are two possibilities to retrieve stored security tokens and utilize them into a webapp or an application:

  • RA model

  • Agent

Use agents if your goal is to build a more comprehensive system that integrates multiple tools. For simple retrieval-augmented generation (RAG) tasks, the RA model is more suitable.

For your chosen solution:

  1. From the Flow, click on the knowledge-bank folder_access_embedding.

  2. In the Actions (Actions icon.) panel, under Knowledge Bank, click Retrieval Augmented.

  3. Select your LLM (GPT-4o in the example).

  4. Name it security_token_retrieval.

  5. Click OK to confirm.

Dataiku screenshot of the embed recipe custom rules.

The RA model is created and linked to your knowledge bank. You must set it up with the proper LLM and security settings.

  1. From the RAG menu, select the LLM of your choice.

  2. Check Enforce document-level security.

  3. Enable Smart mode. The RA model will search in the KB only if it is needed, allowing more fluidity for the chatbot.

  4. Click Save.

Dataiku screenshot of the agent tool creation.

You can explore the other proposed settings for your RA model so it fits your needs.

Use your secured knowledge bank with a webapp#

Now that your pipeline is set up, you can securely use a webapp with a chatbot to interact with the data, provided you belong to the group of users who have the correct permissions. If you don’t, the chatbot will still respond, but the answer won’t include the restricted information. You can do this using Agent Connect, a webapp that makes it easy to chat with and manage all your agents in one place.

Create the webapp#

  1. From the Code (Code icon.) menu in the top navigation bar, select Webapp.

  2. Click + New Webapp and select Visual Webapp.

  3. Select the Agent Connect one.

  4. Enter Secured Chat Bot as the Webapp name.

  5. Click Create.

Dataiku screenshot of the embed recipe custom rules.

Configure Agent Connect#

Next, configure the webapp by specifying settings such as the LLM and the chat history storage location.

  1. In Main LLM, select the LLM of your choice.

  2. As Conversation History Dataset, select New Dataset. Name it history and click Create Dataset.

    Important

    Both the history and user profile must be stored in a SQL-compatible database.

  3. Do the same for the User profile dataset and name it user_profile.

  4. In Projects, add your project.

  5. In Retrieval-Augmented LLMs, add the security_token_retrieval model you created.

    Note

    RA models must be in smart mode to be used in Agent Connect.

  6. Click + Add an object for the RAG descriptions. Select your security_token_retrieval model and provide the appropriate description.

  7. Click Save and navigate to the View tab.

Dataiku screenshot of the embed recipe custom rules.

Agent Connect is now fully configured and connected to the agent responsible for securely retrieving restricted documents based on the SharePoint user group permissions.

Next steps#

You’ve now enabled document-level security on your RAG-powered chatbot. Explore more Generative AI topics in Dataiku, such as Tutorial | Build a multimodal knowledge bank for a RAG project.