How to use Natural Language Toolkit (NLTK) in DSS

Greetings fellow Linguists,

You can start by installing NLTK (Natural Language Toolkit) as any other Python package in DSS, by creating a code environment and adding “nltk” to your package requirements. To do so, follow this section of the Dataiku DSS documentation.

However, some functionalities of NLTK, such as text corpora and language-specific models, rely on resources which are not bundled in the library itself. The full list of available resources is available on the NLTK project site.

To use these models, you need an additional download step. Typically, this can create issues on shared DSS nodes where users do not have write access to shared locations on the server (see the User Isolation Framework documentation for details).

  1. Download NLTK Data for all users (recommended)

Warning

This procedure needs command-line access and administrative privileges on the machine hosting DSS. You may need to speak to your DSS admin and/or Linux admin.

Assuming you are on a Linux machine and have administrative privileges, run:

pip install nltk
sudo python -m nltk.downloader -d /usr/share/nltk_data all

For macOS, the path is slightly different: /usr/local/share/nltk_data.

To test that it worked correctly, run the following code in a notebook using your code environment with nltk.

from nltk.corpus import brown
print(brown.words())

For further details, please refer to this NLTK documentation.

  1. Download NLTK Data for yourself

Warning

This code will not work for other users if your DSS node is configured with the User Isolation Framework.

Run this command without sudo, pointing to your Linux home directory:

python -m nltk.downloader -d /home/<yourLinuxUserName>/nltk_data all

In your Python code, you will then need to set the variable “NLTK_DATA” before running code requiring it.

import os
os.environ['NLTK_DATA'] = /home/<yourLinuxUserName/nltk_data

Happy natural language processing!