NLP and LLMs¶
Learn how to unlock text insights in Dataiku through natural language processing (NLP) and Generative AI technology, specifically large language models (LLMs).
Start here and choose what works for you — whether that’s using native visual features, custom code, or plugins.
Tip
Validate your knowledge of this area by registering for the Dataiku Academy course, NLP - The Visual Way. Then challenge yourself to earn a certification!
Tutorials¶
- Tutorial | Getting started with natural language processing (Visual NLP part 1)
- Tutorial | Cleaning text data (Visual NLP part 2)
- Tutorial | Handling text features for machine learning (Visual NLP part 3)
- Tutorial | Deep learning for sentiment analysis
- Tutorial | Gutenberg plugin for author style recognition
- Tutorial | Sentiment analysis plugin
How-to | Use spaCy models in Dataiku¶
To use spaCy models in Dataiku, start by installing it like any other Python package in Dataiku:
Create a code environment, and add “spacy” to your package requirements.
Note
To do so, follow the reference documentation on managing Python packages.
Be aware that some functionalities of spaCy, such as language-specific tokenizers, rely on models that are not bundled in the library itself. To use these models, you need an additional download step. Typically, this can create issues on shared Dataiku nodes where users do not have write access to shared locations on the server (see User Isolation Framework).
To overcome this challenge, you can use spaCy dedicated pip delivery mechanism.
For instance, in your code-env requirement setting, add
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz

After adding this link, rebuild your code environment.
To test that it works correctly, run the following code in a notebook using this code environment.
import spacy
nlp = spacy.load("en_core_web_sm")
How-to | Use the Python Natural Language Toolkit (NLTK) in Dataiku¶
To use the NLTK (Natural Language Toolkit) in Dataiku, start by installing it like any other Python package in Dataiku:
Create a code environment, and add “nltk” to your package requirements.
Note
To do so, follow the reference documentation on managing Python packages.
Be aware that some functionalities of NLTK, such as text corpora and language-specific models, rely on resources which are not bundled in the library itself. The full list of available resources is available on the NLTK project site.
To use these models, you need one of the three following additional download steps. Typically, this can create issues on shared Dataiku nodes where users do not have write access to shared locations on the server (see the User Isolation Framework documentation for details).
Download NLTK Data for your system user
Warning
This code might not work for other users if your Dataiku node is configured with the User Isolation Framework and didn’t initialize downloads.
Open a local Python notebook running locally with the code environment you built that contains nltk:
import nltk
#this will download nltk in your $HOME/nltk_data directory
nltk.download("all")
Download NLTK Data for all users in a shared temp location
If you wish to share the downloaded packages with many system users, you can choose a custom location accessible to every user running nltk. Some locations would make it available without extra effort, and they are all listed in the nltk.data.path
Python property.
In this example we will be using /tmp/nltk
as our download location.
Start a Python notebook with local execution and the code environment that contains nltk, and run the following code:
import nltk
#this will download nltk /tmp/nltk
nltk.download('all', download_dir='/tmp/nltk')
Once this is done, it’s necessary for every implementation in your project using nltk to set this location:
import nltk
nltk.data.path.append('/tmp/nltk')
You can also initialize your recipes or your notebooks with a dynamic download in case your directory is cleared from time to time.
import nltk
nltk.data.path.append('/tmp/nltk')
if not (os.path.exists("/tmp/nltk")):
nltk.download('all', download_dir='/tmp/nltk')
You can also ask your Dataiku administrator to set the
NLTK_DATA
environment variable in yourDATA_DIR/bin/env-site.sh
as the following:
# add this at the end of your env-site.sh file
export NLTK_DATA=/tmp/nltk
Download NLTK Data for all users (not recommended).
Warning
This procedure needs command-line access and administrative privileges on the machine hosting Dataiku. You may need to speak to your Dataiku admin and/or Linux admin as the maintenance and upgrade might always require their involvement.
Assuming you are on a Linux machine and have administrative privileges, run:
pip install nltk
sudo python -m nltk.downloader -d /usr/share/nltk_data all
For macOS, the path is slightly different: /usr/local/share/nltk_data
To test that it worked correctly, run the following code in a notebook using your code environment with nltk.
from nltk.corpus import brown
print(brown.words())
Note
For further details, please refer to this NLTK documentation.
How-to | Use the OpenAI GPT plugin in Dataiku¶
The OpenAI GPT plugin provides four visual recipes to perform text generation, text classification (zero and few shot), question answering, and text summarization using OpenAI GPT.
To understand how to set up the plugin on your Dataiku instance, check out the How to set up section of our OpenAI GPT plugin page.
To learn how to use the four visual recipes that come with the plugin, follow the How to use section of our OpenAI GPT plugin page.