Tutorial | Custom modeling within the visual ML tool#

The visual ML tool in Dataiku comes with built-in models. You can even extend this functionality by creating your own custom models.

Objectives#

In this tutorial, you will build custom models on a dataset. In the process, you’ll learn the requirements for custom models used in the visual ML tool and implement some of the different ways to create custom models.

Prerequisites#

To become familiar with Visual ML, visit Machine Learning Basics.

You’ll need access to Dataiku version 8.0 or above (the free edition is enough). You can get started by downloading a free trial.

Create the project#

If you previously completed Tutorial | Custom preprocessing within the visual ML tool, you can continue working with the project from that lesson.

Alternatively, you can create a new project with these steps already completed.

  1. From the Dataiku homepage, click +New Project > DSS tutorials > Developer > Custom Modeling in Visual ML.

Note

You can also download the starter project from this website and import it as a zip file.

Explore the project#

The starting Flow of the project consists of an ecommerce_reviews dataset and a vocabulary folder.

Dataiku screenshot of the starting flow for the custom preprocessing tutorial.

The ecommerce_reviews dataset consists of a text feature Review Text which contains customer reviews about women’s clothing items. There is also a Rating feature that indicates the final customer ratings on a scale of 1 to 5. Dataset source: Women’s E-Commerce Clothing Reviews.

The vocabulary folder consists of a text file vocabulary.txt with a list of words.

The project also contains a visual analysis Quick modeling of Rating on ecommerce_reviews that performs:

  • Custom preprocessing of the Review Text feature.

  • Training of a Random Forest classifier and a Logistic Regression classifier, using Rating as the target.

Specify custom models in the visual ML tool#

We will begin by going to the visual analysis Quick modeling of Rating on ecommerce_reviews.

  1. From the Flow, click the Visual Analysis icon in the top navigation bar.

  2. Click Quick modeling of Rating on ecommerce_reviews to open the visual analysis.

  3. Click Models at the top of the visual analysis page to open the model Result page.

  4. Click Design to go to the model design page.

  5. On the Design page, click the Algorithms panel.

    Dataiku screenshot of the Algorithms panel of the visual ML tool.
  6. Click +Add Custom Python Model at the bottom of the list. The list of algorithms begins with the built-in models.

A Python code editor opens with a code template to get you started.

Note

The code in the editor must follow some constraints depending on the backend you’ve chosen (in-memory or MLlib). In this example, we’re using the Python in-memory backend, therefore:

  • The algorithm must be scikit-learn compatible, that is, it needs to have the fit and predict methods.

  • In addition to these methods, classifiers must have a classes_ attribute and can implement a predict_proba method.

The code template lists some additional constraints when creating the custom model.

Dataiku screenshot of the Algorithms panel of the visual ML tool showing a custom Python model.

Import an algorithm from Scikit-learn#

Let’s import a Multi-layer Perceptron classifier from one of the scikit-learn modules. The default code environment (DSS built-in environment) used by the visual ML tool includes scikit-learn, therefore we don’t need to create a new code environment for this.

Note

If you want to import algorithms from different modules (or packages), you first need to create a code environment that includes this module and set the Runtime environment of the visual ML tool to this new code environment.

  1. Delete the template code, and paste the following Python code into the code editor to instantiate the MLP classifier.

    from sklearn.neural_network import MLPClassifier
    clf = MLPClassifier(random_state=1, max_iter=300)
    
  2. Click the pencil icon next to the custom model’s name to rename it from Custom Python model to MLPClassifier.

  3. Click Save in the top right-hand corner.

Dataiku screenshot of the Algorithms panel of the visual ML tool showing a custom scikit-learn model.

Import an algorithm from the project library#

Here, we’ll import an algorithm that we’ve defined in the Project Library.

  1. In the top navigation bar, click Libraries in the Code dropdown menu (</>).

  2. Click the dropdown arrow next to the python folder to see the custom_models.py file.

Dataiku screenshot of the project libraries page showing a Python file for a custom model.

The file contains the definition for an AdaBoostModel classifier. Notice that this classifier is scikit-learn compatible. We will import and use this classifier to create another custom model.

  1. Return to the Design page of the visual ML tool (you can do this quickly by clicking the back arrow in your browser window).

  2. Click +Add Custom Python Model at the bottom of the list.

  3. Rename the model to AdaBoostModel.

  4. Replace the code in the editor with:

    from custom_models import AdaBoostModel
    
    clf = AdaBoostModel()
    
    Dataiku screenshot of the Algorithms panel of the visual ML tool showing the inclusion of custom models.
  5. Click Train to train the models.

  6. Name the session Custom models and click Train.

View session output with custom models#

During training, the Result tab displays a graph of the evolution of the ROC AUC metric during grid search. The grid search option isn’t available to the custom models. However, you can still see the custom models listed along with the other models built during the session.

Dataiku screenshot of session results including builtin and custom models.

Assess performance of the custom models#

Now we’ll open one of the custom models to visualize its performance and all associated visual insights, just as we would do with a built-in model.

  1. Click the MLPClassifier (Custom models) model to open its Report page.

  2. Under Performance, click ROC curve to view the performance metric.

Dataiku screenshot of an ROC curve chart assessing performance of builtin and custom models.

You can visualize the custom model’s training details such as the individual explanations, confusion matrix, calibration curve, ROC curve, and view metrics such as the F1 score. Dataiku is able to create these metrics and visualizations because the custom model is scikit-learn compatible!

The custom model can now be deployed in the flow and used just like a standard built-in model!

What’s next?#

Congratulations! You’ve completed the tutorial for custom modeling!

You learned to:

  • Create custom models that are scikit-learn compatible.

  • Use these custom models in the visual ML tool.

  • Import custom models from packages such as scikit-learn and from the Project library.

  • View training details of a custom model in the visual ML tool.

To learn more about using custom models, in particular, how to implement your own MLlib models in Scala while still using Dataiku modeling in the Visual ML tool, visit Custom Models.