Tutorial | MLlib with Dataiku¶
This content is also included in a free Dataiku Academy course Using MLlib. Register for the course there if you’d like to track and validate your progress alongside concept videos, text summaries, hands-on tutorials, and quizzes.
Apache Spark comes with a built-in module called MLlib, which aims at creating and training machine learning models at scale.
Dataiku DSS makes it easy to use MLlib without coding, using it at an optional backend engine for creating Models directly from within its interface.
You have access to an instance of Dataiku, with Spark enabled, and a working installation of Spark, version 1.4+ (the more recent the version, the better, as the MLlib API is evolving quickly).
We use here the usual Titanic dataset, available for instance from the corresponding Kaggle’s competition. Start with downloading the files, and create the two train and test datasets.
Training an MLlib model¶
Double-click on your train dataset, and create a new Analysis using the green button at the top right. From the Survived column header, click on Create Prediction model…:
This is the important part. In the new modal window, in the ML Backend drop-menu menu, select a Spark configuration (we’ll use the default here):
Create the model. You are taken to a screen telling when the Model is ready to be trained. Do not train the model, but instead click on Settings:
Under the Algorithms section, activate Random Forests:
Click on Train, and wait for your task to complete. Once done, the summary results screen appears:
Your models are now trained. They are ready to be deployed to automate their use to score new records
Using an MLlib model¶
The Random Forests offer the best performance. Click on it, and from the top right, select DEPLOY:
You are taken to the Flow screen. From the last green Prediction icon, select Apply and create a scoring recipe that will be used to score the test set:
Your flow is now complete:
You’ll just need to actually build the dataset to get the predictions!
Using MLlib in Dataiku can now be done entirely from the interface, without having to write complex code. This opens up great opportunities as more and more people will be able to leverage Spark to analyze data.