Product Pillar: AutoML

The AutoML pillar seeks to facilitate and accelerate the design of machine learning models with transparency.

../../../_images/02_GRAPHIC_PILLARS_automl.png

With DSS, data scientists and advanced analysts will find it easy to quickly create machine learning models through a process that builds confidence and trust in every stage of the AI lifecycle. These AI builders will find tools at their disposal to generate new features or better understand model results.

The vision for AutoML, however, extends beyond the data scientist persona.

  • Data engineers will find a scalable, reliable, and governed ML lifecycle from feature engineering to monitoring stages.

  • Business analysts and beneficiaries of AI systems will find up-to-date and explained predictions in order to guide business decisions.

The following lessons briefly illustrate the implementation of this high-level vision for AutoML:

  • The flexible modeling strategies that DSS supports

  • The feature preprocessing steps

  • Native tools for model evaluation

Flexible Strategy

When building machine learning models, DSS allows enterprises to employ a flexible strategy best suited to their needs. At each stage of the machine learning lifecycle, DSS provides a diverse set of visual and coding options to help personas across the enterprise meet their objectives. This allows users to spend less time worrying about the tool and more time focused on analyzing model results.

Many DSS users find value in a simple visual interface that guides the creation of a machine learning task. Whether a prediction or a clustering task, users can choose between automated and expert modes that automatically select settings optimized for quick prototypes, interpretability, or performance. Using the visual interface, users retain full control over all aspects of a model’s design, such as which algorithms to test or which metrics to optimize.

DSS’s visual machine learning comes with support for different training engines, including:

  • In-memory Python (Scikit-learn / XGBoost)

  • MLLib (Spark) engine

  • H2O (Sparkling Water) engine

../../../_images/intro-ml-strategy-ui.png

From our final prepared dataset, we have opened a Lab to initiate an AutoML task to predict the “fare_amount” variable. Based on our goals, we can choose between several different styles of automated ML. We can also specify the engine for training, depending on whether Spark is enabled and the storage type of the dataset.

Other DSS users may prefer to utilize the Python API client to manage the model lifecycle entirely through code without ever touching the visual interface.

../../../_images/intro-ml-strategy-api.png

Coders on a team can programmatically create and manage machine learning models in DSS without using any of the visual tools.

Moreover, the choice between clicking or coding through a machine learning task never needs to be entirely one or the other. Models created through the API can be adjusted in the visual interface like any other model. Models created through the visual interface can be monitored through the API and vice versa.

Deep learning with DSS is one excellent example of mixing code and visual tools. DSS provides a dedicated coding environment for defining a network architecture with Keras code. However, once a deep learning model is built, the model can be evaluated, and even deployed, like any other visual model in DSS.

../../../_images/intro-deep-learning.png

Here we can define a deep learning model architecture with the Keras library and deploy the resulting model with the visual tools in DSS.

Feature Preprocessing

A model in DSS is not just the ML algorithm itself, but an entire pipeline of activities from preprocessing to evaluation. This pipeline can begin with the script of a visual analysis, where users can use all of the tools in the processor library to prepare datasets for modeling.

When the input data is ready, users can view and adjust how the model will feed every feature into the algorithm. This includes strategies for imputing missing values or dummy-encoding categorical variables.

../../../_images/intro-features-handling.png

In our model to predict NY taxi fares, we can control how the model will feed the existing features to the algorithm. Here we can adjust rescaling options and imputation strategies for the feature “driving_time_in_traffic”. We can also click to apply common techniques for increasing or reducing the size of the feature space.

Users also have a visual interface for generating or reducing the size of the feature space. Users can click to generate potentially thousands of new features by enabling pairwise linear or polynomial combinations, or explicit pairwise interactions. On the other hand, users seeking to reduce the dimensions of the feature space can click to apply a reduction method, such as Principal Component Analysis.

Model Evaluation

Users can set the evaluation framework DSS uses to compare models:

  • Across various machine learning algorithms (whether defined through code or the visual UI)

  • Across various hyperparameter values for each algorithm

These options create a grid of potential models to train and evaluate. Users can set a maximum search time or a maximum number of iterations for any grid search.

../../../_images/intro-grid-search.png

In an initial prototype for predicting taxi fares, here we have trained a random forest model on just two combinations of one parameter, the maximum depth of a tree. The observed variability in R-squared moving from a depth of 6 to 15 trees suggests we might want to test a wider range.

DSS provides tools for quickly evaluating and comparing performance across multiple models. Users can drill down into a particular model to examine a full report of metrics and visualizations geared towards understanding both a model’s performance (such as a plot of error distribution) and its interpretability (such as a plot of variable importance).

DSS presents the most relevant summaries based on the type of machine learning task and the algorithms being tested. For example, the model report for a classification task will include a confusion matrix and ROC curve, while the model report for a clustering task will include a heatmap and cluster profiles.

../../../_images/intro-model-eval1.gif

Our top-performing model to predict taxi fares was an imported Light GBM model. The model report gives a full breakdown of its performance, along with plots of variable importance, partial dependence, and subpopulation analysis.