Hands-On Tutorial: Visual ML Enhancements

Dataiku DSS V10 comes with enhancements to the Visual ML interface. These include:

  • Additional feature handling methods,

  • Native support for the LightGBM algorithm, and

  • ML task queues.

New Feature Handling Methods

The visual ML interface of Dataiku DSS now includes additional feature encoding methods that allow DSS to model categorical and date features in more interesting ways to produce better-performing models.

For the categorical features, you can now select from additional options that include:

  • Frequency encoding, which replaces the categories by their number of occurrences

  • Ordinal encoding, which assigns a unique integer value to each category, according to an order defined by count or lexicography

  • Target encoding, which replaces each category with a numerical value computed based on the target values.

There is now an additional Cyclical datetime encoding option that transforms datetime features (timestamps) into numerical features while preserving the cyclical significance of the date and time periods. This new option is available to all numerical features.

To learn more about these feature handling methods, visit the Features Handling page in the product documentation.

Native Support for LightGBM Algorithm

LightGBM is a modern gradient boosting algorithm, often seen as a successor to XGBoost due to its comparable performance at a fraction of the training time. By elevating LightGBM from its former plugin model implementation, Visual ML users can interact with it more efficiently and benefit from the full Visual ML experience (e.g., Bayesian hyperparameter search) in Dataiku DSS.

To learn more about this new capability of Dataiku DSS, visit the In-memory Python (Scikit-learn / LightGBM / XGBoost) page in the product documentation.

ML Task Queues

With the ML Tasks queues feature, data scientists can decouple the feature engineering and model training steps by queueing training sessions. These training sessions can have different model designs; e.g., they can use different feature handling methods or different algorithms.

Queueing model training sessions eliminates the need to wait for a session’s training to finish before preparing the next model design experiment. Furthermore, these training queues can be scheduled for execution when resources are more abundant to minimize their impact on other teams.

In the hands-on lesson that follows, you’ll learn how to use these Visual ML enhancements.

Getting Started

You will need a Dataiku DSS project that contains a predictive model. To access and create the starter project:

  • From the Dataiku DSS homepage, click +New Project > DSS Tutorials > General Topics > Credit Card Fraud (Tutorial).

Note

You can also download the starter project from this website and import it as a zip file.

You should now be on the project’s homepage.

  • Go to the Flow and Select Build all from the Flow Actions button in the bottom right of the Flow.

  • Go to the project’s Visual Analyses page from the top navigation bar.

  • Open the existing Prediction Model analysis.

Open prediction model analysis.

  • Click the Models tab to land on the “Result” page of the analysis.

Model result page with one session.

Here, you can see that there is a previously trained session with two models.

Add a LightGBM Model to an ML Task Queue

Let’s continue with this visual analysis by creating a new model training session that uses the LightGBM algorithm.

  • Click the Design tab.

  • Go to the Algorithms panel.

  • Click the slider next to LightGBM to select it.

  • Unselect the “Random Forest” and “Logistic Regression” algorithms.

Because we want to continue designing experimental models, we will wait to train the ML task that we just designed. Let’s add the task to a queue.

  • Click the drop-down arrow next to the Train button.

  • Click Add To Queue.

Add first ML task to queue.

  • Name the session LightGBM.

  • Click Add To Queue.

The Train button has changed to Train Queue, and a new button Add To Queue appears next to it.

  • Click the Result to switch to the Result page and notice that the training session for the current model design has been added to the queue.

Result page showing first ML task added to the queue.

Queue a Session That Uses Frequency Encoding of a Categorical Feature

Let’s design another model training session. This time, we’ll include the Random Forest and Logistic Regression algorithms as well as the LightGBM one.

  • In the previously trained “Session 1”, click the Revert Design to This Session icon (between the session name and the “Delete” icon).

Revert to first session's design.
  • Click Confirm to use the design specified for Session 1.

  • In the “Algorithms” panel, select the LightGBM algorithm in addition to the already selected “Random Forest” and “Logistic Regression” algorithms.

  • Go to the Features Handling panel.

  • Search for the “Merchant_state” feature and select it. This feature is currently “dummy encoded.”

  • Change the “Category handling” to use the Frequency encoding option.

  • Click the Add to Queue button to add the training session for this ML task to the Queue.

  • Name the session merchant state - frequency encoding.

  • Click Add to Queue.

Queue a Session That Uses Cyclical Encoding of a Numerical Feature

Similarly, we’ll design another model training session. This time, we’ll use all three algorithms (Random Forest, Logistic Regression, and LightGBM).

  • Go to the Result tab.

  • In the previously trained “Session 1”, click the icon Revert Design to This Session.

  • Click Confirm to use the design specified for Session 1.

  • In the “Algorithms” panel, select the LightGBM algorithm in addition to the already selected “Random Forest” and “Logistic Regression” algorithms.

  • Go to the Features Handling panel.

  • Search for the “purchase_date” feature and enable it.

Cyclical datetime encoding of the "purchase date" feature.

Dataiku recognizes that purchase_date is a date feature and selects the “cyclical datetime encoding” feature handling method.

Note

The Cyclical datetime encoding feature handling method is available to all numeric features.

  • Click the Add to Queue button to add the training session for this ML task to the Queue.

  • Name the session purchase date - datetime encoding.

  • Click Add to Queue.

You can see a list of queued sessions from the Design tab. For this,

  • Click the drop-down arrow next to Train Queue.

See queued sessions from design tab.

Train the ML Task Queue

We’re now ready to train the ML task queue. We’ll show two ways to do this. First, we’ll train the sessions right within the visual ML interface. Then, we’ll train the sessions by running a macro.

Let’s begin by training the queue from the visual ML interface.

  • Go to the Result tab.

  • Click the Train Queue button.

The first session in the queue, the “LIGHTGBM” session begins to train. When this session is done training, the next queued session begins training. Let’s interrupt the session training. To do this,

  • Click Abort in the Result tab.

Dataiku informs you that aborting the training session will start training the next session in the queue. However, let’s pause the queue, so that Dataiku doesn’t immediately start training the models in the next session.

  • Check the box next to “Pause queue before aborting.”

  • Click Confirm.

Pause then abort training of ML task queue.

Let’s now finish training the ML task queue by running a macro.

  • From the top navigation bar, go to the More Options (…) menu and click Macros.

  • In the section for “Builtin macros,” click Train paused ML task queues.

You have the option to run “all queues in the current project” or run a “single ML task queue”.

  • Select to run Single ML task queue.

  • Specify “Analysis”: Prediction Model and “ML Task”: Predict authorized_flag.

  • Click Run Macro.

Train ML task queue from macro.

  • Click the test Predict authorized_flag (Prediction Model) in the “Train paused ML task queues” pop-up window to open the Result tab of the Prediction Model analysis in a new window.

Go to ML task queue training result from macro.

Notice that the ML task queue continues training from the next session (“Purchase Date - Datetime Encoding”). Training for the third session where we aborted training doesn’t get restarted.

  • Wait for the session to finish training and observe the results.

Train final ML task queue session.