How-to | Use spaCy models in Dataiku#

To use spaCy models in Dataiku, start by installing it like any other Python package in Dataiku:

  1. Create a code environment, and add spacy to your package requirements.

    Note

    To do so, follow the reference documentation on managing Python packages.

    Be aware that some functionalities of spaCy, such as language-specific tokenizers, rely on models that are not bundled in the library itself. To use these models, you need an additional download step. Typically, this can create issues on shared Dataiku nodes where users do not have write access to shared locations on the server (see User Isolation Framework).

    To overcome this challenge, you can use spaCy dedicated pip delivery mechanism.

  2. For instance, in your code-env requirement setting, add https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz.

    Dataiku screenshot of a Python code environment for NLP.
  3. After adding this link, rebuild your code environment.

  4. To test that it works correctly, run the following code in a notebook using this code environment.

    import spacy
    nlp = spacy.load("en_core_web_sm")