Hands-On Tutorial: Custom Preprocessing in the Visual ML Tool

Dataiku DSS provides a powerful visual ML tool with built-in feature preprocessing. You can even extend this functionality by using code to implement custom feature preprocessing.

Let’s Get Started!

In this hands-on lesson, you will implement custom preprocessing on a text column of a dataset. In the process, you’ll learn the requirements for processors that you can use in the visual ML tool.

Prerequisites

To become familiar with AutoML, visit Machine Learning Basics course.

You’ll need access to Dataiku version 8.0 or above (the free edition is enough). You can get started by downloading a free trial.

Create the Project

The first step is to create the project.

  • From the Dataiku homepage, click +New Project > DSS Tutorials > Developer > Custom Preprocessing in Visual ML (Tutorial).

Note

You can also download the starter project from this website and import it as a zip file.

Explore the Project

The starting Flow of the project consists of an ecommerce_reviews dataset and a vocabulary folder.

Dataiku screenshot of the starting flow for the custom preprocessing tutorial.

The ecommerce_reviews dataset consists of a text feature, Review Text, which contains customer reviews about women’s clothing items. There is also a Rating feature that indicates the final customer ratings on a scale of 1 to 5. The source of this dataset is Women’s E-Commerce Clothing Reviews.

The vocabulary folder consists of a text file, vocabulary.txt, which lists words that you will use to perform count vectorization on the Review Text feature.

Preprocess Features in the Visual ML Tool

To begin the analysis, you’ll use the visual ML tool to build a prediction model. This model will predict the Rating assigned to a particular Clothing ID using the other features in the ecommerce_reviews dataset.

Tip

This hands-on tutorial will focus only on the custom preprocessing of features in the dataset.

  • From the Flow, click the ecommerce_reviews dataset to select it.

  • Open the right panel and click LAB.

  • Click AutoML Prediction and select Rating as the target feature.

  • Click Quick Prototypes.

  • Click Create.

Dataiku screenshot of the dialog for creating a quick prototype ML task.

Dataiku has selected some algorithms that are ready to be trained in the visual ML tool. However, before training the selected algorithms, explore the feature preprocessing.

Explore the Built-in Preprocessing

The visual ML tool of Dataiku has several built-in feature preprocessing methods. To access the preprocessors,

  • Click the Design tab, and then click the Features Handling panel.

  • Select any feature from the list to see its built-in feature handling methods and some summary statistics.

Because the visual ML tool rejects text features by default, it has rejected the two text features, Title and Review Text, in the dataset. However, you will use the Review Text feature in your analysis.

  • Click the “On/Off” slider next to Review Text to enable this feature.

The default text handling method is “Tokenize, hash and apply SVD”. Next, you’ll apply a custom preprocessing method to handle this feature.

Dataiku screenshot of the Features handling panel of the visual ML tool highlighting a text feature.

Apply Custom Preprocessing

Instead of using the default selection for preprocessing the Review Text feature, apply a different kind of preprocessing — in this case, count vectorization — to transform the text into numeric values.

Notice that Dataiku has a “Count vectorization” option that you can select. This option requires that you specify values for parameters such as “Stop words”, “Ngrams”, etc. However, in this tutorial, you’ll use the coding approach to implementing count vectorization by using the “Custom preprocessing” option.

  • Click the dropdown menu next to “Text handling” and select Custom preprocessing. Dataiku DSS displays a code editor with template Python code.

Dataiku screenshot of the template code given when choosing custom preprocessing for a text feature.

Next, you’ll replace this template code with a custom preprocessing Python code that implements the CountVectorizer processor from scikit-learn and uses the list of words in the vocabulary.txt file as an a-priori dictionary of terms for the processor.

Note

When writing custom preprocessing Python code, the processor must be scikit-learn compatible. That is, it needs to have the fit and transform methods.

  • The fit method must modify the object in place if a fitting is necessary.

  • The transform method must return a Pandas DataFrame, a 2-D NumPy array, or a SciPy compressed sparse row matrix (scipy.sparse.csr_matrix) containing the preprocessed result. If the transform method returns a NumPy array or a SciPy compressed sparse row matrix, then the processor should also have a names attribute containing the list of the output feature names.

  • You must assign the processor to a variable named processor. Dataiku DSS looks for this variable to apply the processor to the desired feature.

For more details, visit Custom Preprocessing.

  • Delete the template code and paste the following Python code into the code editor. Then click Save.

from sklearn.feature_extraction.text import CountVectorizer
import dataiku
fold = dataiku.Folder("vocabulary")
print("folder found")

with fold.get_download_stream('vocabulary.txt') as f:
   voc = f.read().decode()
voc = voc.split('\n')

# A custom code vectorizer must define the 'transformer' variable
processor = CountVectorizer(
            min_df = 10, # Tokens must appear at least in 3 documents
            max_df = 0.6, # Tokens that appear in more than 80% of documents are ignored
            ngram_range = (1,2),
            vocabulary = voc,
            # Here we override the token selection regexp
            token_pattern = u'(?u)\\b\\w\\w\\w\\w\\w+\\b')

The preprocessing code does the following:

  1. Imports the CountVectorizer preprocessing method from one of the scikit-learn modules.

    Tip

    The default DSS built-in code environment includes the scikit-learn package. However, suppose you want to import a package that the default code environment does not contain. You can first create a new code environment that includes this package and call it from the “Runtime environment” panel of the visual ML tool.

    You can also call classes that exist in the project library.

  2. Opens the vocabulary.txt file and creates a list of words from it.

  3. Instantiates the CountVectorizer, assigns the list of words to the “vocabulary” parameter, and stores the processor in a processor variable.

Train the Selected Algorithms

Once you’ve written the code for custom preprocessing, you can proceed to train the selected algorithms in the visual ML tool.

  • Click the Algorithms panel on the left-hand side of the page to see that Dataiku has selected “Random Forest” and “Logistic Regression” for training.

  • Click the Train button in the top right corner of the page to train the models.

  • Name the session Count Vectorization Preprocessing, and click Train.

  • Once training completes, click the Random forest (Count Vectorization Preprocessing) model to open the Report page.

On the Report page, you can click the panels on the left-hand side to see model interpretations, performance charts, and model information.

Dataiku screenshot of the variable importance chart after training a model with custom preprocessing.

For example, the Variable importance chart shows many features that begin with the prefix Review Text:unnamed_. These features correspond to the words from the vocabulary.txt file used to perform count vectorization on the Review Text feature.

What’s Next?

Congratulations! You’ve completed the hands-on tutorial for Custom Preprocessing!

You saw how to perform custom preprocessing in the visual ML tool. You also learned that any custom processor you use must be scikit-learn compatible.

For more information, visit Writing custom models.

To continue learning and building this project, see the next lesson, Hands-On Tutorial: Custom Modeling in the Visual ML Tool, to implement some of the different ways of creating custom models on a dataset.