Tutorial | Custom preprocessing within the visual ML tool#

Dataiku provides a powerful visual ML tool with built-in feature preprocessing. You can even extend this functionality by using code to implement custom feature preprocessing.

Objectives#

In this tutorial, you will:

  • Implement custom preprocessing on a text column of a dataset.

  • Learn the requirements for processors that you can use in the visual ML tool.

Prerequisites#

To become familiar with AutoML, visit the Machine Learning Basics course.

You’ll need access to Dataiku version 8.0 or above (the free edition is enough). You can get started by downloading a free trial.

Create the project#

The first step is to create the project.

  1. From the Dataiku homepage, click +New Project > DSS tutorials > Developer > Custom Preprocessing in Visual ML.

Note

You can also download the starter project from this website and import it as a zip file.

Explore the project#

The starting Flow of the project consists of an ecommerce_reviews dataset and a vocabulary folder.

Dataiku screenshot of the starting flow for the custom preprocessing tutorial.

The ecommerce_reviews dataset consists of a text feature, Review Text, which contains customer reviews about women’s clothing items. There is also a Rating feature that indicates the final customer ratings on a scale of 1 to 5. The source of this dataset is Women’s E-Commerce Clothing Reviews.

The vocabulary folder consists of a text file, vocabulary.txt, which lists words that you will use to perform count vectorization on the Review Text feature.

Preprocess features in the visual ML tool#

To begin the analysis, you’ll use the visual ML tool to build a prediction model. This model will predict the Rating assigned to a particular Clothing ID using the other features in the ecommerce_reviews dataset.

Tip

This tutorial will focus only on the custom preprocessing of features in the dataset.

  1. From the Flow, click the ecommerce_reviews dataset to select it.

  2. From the right panel, in the Actions tab, click LAB.

  3. Click AutoML Prediction and select Rating as the target feature.

  4. Select Quick Prototypes and click on Create.

Dataiku screenshot of the dialog for creating a quick prototype ML task.

Dataiku has selected some algorithms that are ready to be trained in the visual ML tool. However, before training the selected algorithms, explore the feature preprocessing.

Explore the built-in preprocessing#

The visual ML tool of Dataiku has several built-in feature preprocessing methods. To access the preprocessors,

  1. Click the Design tab, and then click the Features Handling panel.

  2. Select any feature from the list to see its built-in feature handling methods and some summary statistics.

    Note

    Because the visual ML tool rejects text features by default, it has rejected the two text features, Title and Review Text, in the dataset. However, you will use the Review Text feature in your analysis.

  3. Click the On/Off slider next to Review Text to enable this feature.

The default text handling method is Tokenize, hash and apply SVD. Next, you’ll apply a custom preprocessing method to handle this feature.

Dataiku screenshot of the Features handling panel of the visual ML tool highlighting a text feature.

Apply custom preprocessing#

Instead of using the default selection for preprocessing the Review Text feature, you can apply a different kind of preprocessing to transform the text into numeric values: count vectorization. This option in Dataiku requires that you specify values for parameters such as Stop words, Ngrams, etc. However, in this tutorial, you’ll use the coding approach to implementing count vectorization by using the Custom preprocessing option.

  1. Click the dropdown menu next to Text handling.

  2. Select Custom preprocessing. Dataiku displays a code editor with template Python code.

Dataiku screenshot of the template code given when choosing custom preprocessing for a text feature.

Next, you’ll replace this template code with a custom preprocessing Python code that implements the CountVectorizer processor from scikit-learn and uses the list of words in the vocabulary.txt file as an a-priori dictionary of terms for the processor.

Note

When writing custom preprocessing Python code, the processor must be scikit-learn compatible. That is, it needs to have the fit and transform methods.

  • The fit method must modify the object in place if a fitting is necessary.

  • The transform method must return a Pandas DataFrame, a 2-D NumPy array, or a SciPy compressed sparse row matrix (scipy.sparse.csr_matrix) containing the preprocessed result. If the transform method returns a NumPy array or a SciPy compressed sparse row matrix, then the processor should also have a names attribute containing the list of the output feature names.

  • You must assign the processor to a variable named processor. Dataiku looks for this variable to apply the processor to the desired feature.

For more details, visit Custom Preprocessing.

  1. Delete the template code and paste the following Python code into the code editor.

  2. Click Save.

from sklearn.feature_extraction.text import CountVectorizer
import dataiku
fold = dataiku.Folder("vocabulary")
print("folder found")

with fold.get_download_stream('vocabulary.txt') as f:
   voc = f.read().decode()
voc = voc.split('\n')

# A custom code vectorizer must define the 'transformer' variable
processor = CountVectorizer(
            min_df = 10, # Tokens must appear at least in 3 documents
            max_df = 0.6, # Tokens that appear in more than 80% of documents are ignored
            ngram_range = (1,2),
            vocabulary = voc,
            # Here we override the token selection regexp
            token_pattern = u'(?u)\\b\\w\\w\\w\\w\\w+\\b')

The preprocessing code does the following:

  1. Imports the CountVectorizer preprocessing method from one of the scikit-learn modules.

    Tip

    The default Dataiku built-in code environment includes the scikit-learn package. However, suppose you want to import a package that the default code environment does not contain. You can first create a new code environment that includes this package and call it from the Runtime environment panel of the visual ML tool.

    You can also call classes that exist in the project library.

  2. Opens the vocabulary.txt file and creates a list of words from it.

  3. Instantiates the CountVectorizer, assigns the list of words to the “vocabulary” parameter, and stores the processor in a processor variable.

Train the selected algorithms#

Once you’ve written the code for custom preprocessing, you can proceed to train the selected algorithms in the visual ML tool.

  1. Click the Algorithms panel on the left-hand side of the page to see that Dataiku has selected Random Forest and Logistic Regression for training.

  2. Click the Train button in the top right corner of the page to train the models.

  3. Name the session Count Vectorization Preprocessing, and click Train.

  4. Once training completes, click the Random forest (Count Vectorization Preprocessing) model to open the Report page.

On the Report page, you can click the panels on the left-hand side to see model interpretations, performance charts, and model information.

For example, the Gini method for feature importance displays many features that include Review Text:unnamed_ in the name. These features correspond to the words from the vocabulary.txt file used to perform count vectorization on the Review Text feature.

Dataiku screenshot of the variable importance chart after training a model with custom preprocessing.

Tip

Explore and compare the results of Shapley and Gini methods for a more informed view of feature importance.

What’s next?#

Congratulations! You’ve completed the tutorial for custom preprocessing!

You saw how to perform custom preprocessing in the visual ML tool. You also learned that any custom processor you use must be scikit-learn compatible. For more information, visit Writing custom models.

To continue building this project, visit Tutorial | Custom modeling within the visual ML tool to implement some of the different ways of creating custom models on a dataset.