Tutorial | MLlib with Dataiku#


Apache Spark comes with a built-in module called MLlib, which aims at creating and training machine learning models at scale.

Dataiku makes it easy to use MLlib without coding, using it at an optional backend engine for creating Models directly from within its interface.


You have access to an instance of Dataiku, with Spark enabled, and a working installation of Spark, version 1.4+ (the more recent the version, the better, as the MLlib API is evolving quickly).

We use here the usual Titanic dataset, available for instance from the corresponding Kaggle’s competition. Start with downloading the files, and create the two train and test datasets.

Training an MLlib model#

  1. Double-click on your train dataset, and create a new Analysis using the green button at the top right.

  2. From the Survived column header, click on Create Prediction model….

    Creating a new prediction model from within a Dataiku analysis.
  3. This is the important part. In the new modal window, in the ML Backend drop-menu menu, select a Spark configuration (we’ll use the default here).

    Choosing Spark as the machine learning backend.
  4. Create the model. You are taken to a screen telling when the Model is ready to be trained. Do not train the model, but instead click on Settings.

    Where to find the Settings of a new prediction model.
  5. Under the Algorithms section, activate Random Forests.

    Choosing the algorithms to run in a prediction model.
  6. Click on Train, and wait for your task to complete. Once done, the summary results screen appears.

    Summary results screen for a prediciton model.

Your models are now trained. They are ready to be deployed to automate their use to score new records.

Using an MLlib model#

The Random Forests offer the best performance.

  1. Click on it, and from the top right, select Deploy.

    Deploying a prediciton model.

    You are taken to the Flow screen.

  2. From the last green Prediction icon, select Apply and create a scoring recipe that will be used to score the test set.

    Creating a Scoring recipe from a deployed model.

Your Flow is now complete:

Completed Flow with deployed prediction model built using MLlib algorithms.

You’ll just need to actually build the dataset to get the predictions!

Using MLlib in Dataiku can now be done entirely from the interface, without having to write complex code. This opens up great opportunities as more and more people will be able to leverage Spark to analyze data.