How to Use the Python Natural Language Toolkit (NLTK) in Dataiku¶

Greetings fellow Linguists,

You can start by installing NLTK (Natural Language Toolkit) as any other Python package in DSS, by creating a code environment and adding “nltk” to your package requirements. To do so, follow this section of the Dataiku DSS documentation.

However, some functionalities of NLTK, such as text corpora and language-specific models, rely on resources which are not bundled in the library itself. The full list of available resources is available on the NLTK project site.

To use these models, you need an additional download step. Typically, this can create issues on shared DSS nodes where users do not have write access to shared locations on the server (see the User Isolation Framework documentation for details).

Download NLTK Data for your system user

Warning

This code might not work for other users if your DSS node is configured with the User Isolation Framework and didn’t initialize downloads.

Open a local Python notebook running locally with the code environment you built that contains nltk:

import nltk
#this will download nltk in your $HOME/nltk_data directory
nltk.download("all")

Download NLTK Data for all users in a shared temp location:

If you wish to share the downloaded packages with many system users, you can choose a custom location accessible to every user running nltk. Some locations would make it available without extra effort, and they are all listed in the nltk.data.path Python property.

In this example we will be using /tmp/nltk as our download location.

You need to start a Python notebook with local execution and the code environment that contains nltk, and run the following code:

import nltk
#this will download nltk /tmp/nltk
nltk.download('all', download_dir='/tmp/nltk')

Once this is done, it’s necessary for every implementation in your project using nltk to set this location:

import nltk
nltk.data.path.append('/tmp/nltk')

You can also initialize your recipes or your notebooks with a dynamic download in case your directory is cleared from time to time.

import nltk
nltk.data.path.append('/tmp/nltk')
if not (os.path.exists("/tmp/nltk")):
  nltk.download('all', download_dir='/tmp/nltk')

You can also ask your Dataiku administrator to set the NLTK_DATA environment variable in your DATA_DIR/bin/env-site.sh as the following:

# add this at the end of your env-site.sh file
export NLTK_DATA=/tmp/nltk

Download NLTK Data for all users (not recommended)

Warning

This procedure needs command-line access and administrative privileges on the machine hosting DSS. You may need to speak to your DSS admin and/or Linux admin as the maintenance and upgrade might always require their involvement.

Assuming you are on a Linux machine and have administrative privileges, run:

pip install nltk
sudo python -m nltk.downloader -d /usr/share/nltk_data all

For macOS, the path is slightly different: /usr/local/share/nltk_data

To test that it worked correctly, run the following code in a notebook using your code environment with nltk.

from nltk.corpus import brown
print(brown.words())

For further details, please refer to this NLTK documentation.

Happy natural language processing!