Quick Start | Dataiku for machine learning#

Get started#

Recent advancements in generative AI have made it easy to apply for jobs. But be careful! Scammers have also been known to create fake job applications in the hopes of stealing personal information. Let’s see if you — with Dataiku’s help — can spot a real job posting from a fake one!

Objectives#

In this quick start, you’ll:

  • Use a visual recipe to divide data into training and testing sets.

  • Train prediction models for a binary classification task.

  • Iterate on the design of a model training session.

  • Apply a chosen model to new data.

Note

This quick start introduces Dataiku’s visual tools for machine learning. If your primary interest is using custom code and Dataiku for machine learning projects, please see the Developer Guide getting started section.

Tip

To check your work, you can review a completed version of this entire project from data preparation through MLOps on the Dataiku gallery.

Create an account#

To follow along with the steps in this tutorial, you need access to a 11.0+ Dataiku instance. If you do not already have access, you can get started in one of two ways:

  • Start a 14 day free trial. See this how-to for help if needed.

  • Install the free edition locally for your operating system.

Open Dataiku#

The first step is getting to the homepage of your Dataiku Design node.

  1. Go to the Launchpad.

  2. Click Open Instance in the Design node tile of the Overview panel once your instance has powered up.

  3. See this how-to if you encounter any difficulties.

Important

If using a self-managed version of Dataiku, including the locally-downloaded free edition on Mac or Windows, open the Dataiku Design node directly in your browser.

Create the project#

Once you are on the Design node homepage, you can create the tutorial project.

  1. From the Dataiku Design homepage, click + New Project.

  2. Click DSS tutorials in the dropdown menu.

  3. In the dialog, click Quick Starts on the left hand panel.

  4. Choose Machine Learning Quick Start, and then click OK.

Dataiku screenshot of the dialog for creating a new project.

Note

You can also download the starter project from this website and import it as a zip file.

Review the Flow#

See a screencast covering this section’s steps

One of the first concepts a user needs to understand about Dataiku is the Flow. The Flow is the visual representation of how datasets, recipes (steps for data transformation), and models work together to move data through an analytics pipeline.

See the Flow’s visual grammar#

Dataiku has its own visual grammar to organize AI, machine learning, and analytics projects in a collaborative way.

Shape

Item

Icon

Dataset icon.

Dataset

The icon on the square represents the dataset’s storage location, such as Amazon S3, Snowflake, PostgreSQL, etc.

Recipe icon.

Recipe

The icon on the circle represents the type of data transformation, such as a broom for a Prepare recipe or coiled snakes for a Python recipe.

Dataset icon.

Model

The icon on the diamond represents the type of modeling task, such as prediction, clustering, time series forecasting, etc.

Tip

In addition to shape, color has meaning too.

  • Datasets are blue, but those shared from other projects are black.

  • Visual recipes are yellow. Code recipes are orange. LLM recipes are pink. Plugin recipes are red.

  • Machine learning elements are green.

Take a look now!

  1. If not already there, from the left-most menu in the top navigation bar, click on the Flow (or use the keyboard shortcut g + f).

  2. Double click on the job_postings dataset to open it.

Dataiku screenshot of a basic Flow.

Tip

There are many other keyboard shortcuts beyond g + f. Type ? to pull up a menu or see the complete list in the reference documentation.

Analyze the data#

This project begins from a labeled dataset named job_postings composed of 95% real and 5% fake job postings. For the column fraudulent, values of 0 and 1 represent real and fake job postings, respectively. Your task will be to build a prediction model capable of classifying a job posting as real or fake.

Let’s take a quick look at the data.

  1. Click on the header of the first column job_id to open a menu of options.

  2. Select Analyze.

  3. Use the arrows at the top left of the dialog to cycle through presentations of each column summary, including the target variable fraudulent column.

Dataiku screenshot of the Analyze tool.

Build the Flow#

Unlike the initial uploaded datasets, the downstream datasets appear as outlines. This is because they have not been built, meaning that the relevant recipes have not been run to populate these datasets. However, this is not a problem because the Flow contains the recipes required to create these outputs at any time.

  1. Navigate back to the Flow (g + f).

  2. Click to open the Flow Actions menu in the bottom right.

  3. Click Build all.

  4. Leaving the default options, click Build to run the recipes necessary to create the items furthest downstream.

  5. When the job completes, refresh the page to see the built Flow.

Dataiku screenshot of the dialog for building the Flow.

Note

To learn more about creating this Flow, see the Quick Start | Dataiku for data preparation.

Split data into training and testing sets#

See a screencast covering this section’s steps

One advantage of an end-to-end platform like Dataiku is that data preparation can be done in the same tool as machine learning. For example, before building a model, you may wish to create a holdout set. Let’s do this with a visual recipe.

  1. From the Flow, click the job_postings_prepared_joined dataset once to select it.

  2. Open the Actions tab.

  3. Select the Split recipe from the menu of visual recipes.

  4. Click + Add; name the output train; and click Create Dataset.

  5. Click + Add again; name the second output test; and click Create Dataset.

  6. Once you have defined both output datasets, click Create Recipe.

Dataiku screenshot of the dialog to create a Split recipe.

Define a Split method#

The Split recipe allows you to divide the input dataset into some number of output datasets in different ways, such as by mapping values of a column, defining filters, or as you’ll see here, randomly:

  1. On the Splitting step of the recipe, choose Randomly dispatch data as the splitting method.

  2. Set the ratio of 80 % to the train dataset, and the remaining 20% to the test dataset.

  3. Click the green Run at the bottom left (or type @ + r + u + n) to build these two output datasets.

  4. When the job finishes, navigate back to the Flow (g + f) to see your progress.

Dataiku screenshot of the settings for a Split recipe.

Create a separate Flow zone#

Before you start training models, there’s one organizational step that will be helpful as your projects grow in complexity. Let’s create a separate Flow zone for the machine learning stage of this project.

  1. Use the Command/Ctrl key and the cursor to select both the train and test datasets.

  2. Open the Actions tab.

  3. In the Flow Zones section, click Move.

  4. Name the new zone Machine Learning.

  5. Click Confirm.

Dataiku screenshot of the dialog for creating a Flow zone.

Now just rename the default zone, and you’ll have two clear spaces for these two stages of the project.

  1. Click on the original Default zone.

  2. Open the Actions tab.

  3. Select Edit.

  4. Give the name Data Preparation.

  5. Click Confirm.

Dataiku screenshot of the dialog for editing a zone.

Train machine learning models#

See a screencast covering this section’s steps

Now that we have set aside the train and test data in a separate Flow zone, let’s start creating prediction models on the training data!

Important

Here we’ll use visual AutoML within Dataiku to train a model, but this is not the only option. In addition to using code for custom preprocessing or custom algorithms within the visual ML interface, users can also import MLflow models into Dataiku.

Create an AutoML prediction task#

The first step is to define the basic parameters of the machine learning task at hand.

  1. Double-click anywhere on the Machine Learning Flow zone to open it.

  2. Select the train dataset.

  3. In the Actions tab, click on the Lab button. Alternatively, navigate to the Lab tab of the right side panel (shown below).

  4. Among the menu of visual ML tasks, choose AutoML Prediction.

Dataiku screenshot of the interface for selecting an autoML prediction task.

Now you just need to choose the target variable and which kind of models you want to build.

  1. Choose fraudulent as the target variable on which to create the prediction model.

  2. Click Create, keeping the default setting of Quick Prototypes.

Dataiku screenshot of the dialog for creating an AutoML prediction task.

Train models with the default design#

Based on the characteristics of the input training data, Dataiku has automatically prepared the design of the model. But no models have been trained yet!

  1. Before adjusting the design, click Train to start a model training session.

  2. Click Train again to confirm.

Dataiku screenshot of the dialog to train an ML model.

Inspect a model’s results#

See a screencast covering this section’s steps

Once your models have finished training, let’s see how Dataiku did.

  1. While in the Result tab, click on the Random forest model in Session 1 on the left hand side of the screen to open a detailed model report.

Dataiku screenshot the Result tab for a prediction task.

Check model explainability#

One important aspect of a model is the ability to understand its predictions. The Explainability section of the report includes many tools for doing so.

  1. In the Explainability section, click to open the Feature importance panel to see an estimate of the influence of a feature on the predictions.

Dataiku screenshot of the feature importance chart for a model in the Lab.

Note

Due to the somewhat random nature of algorithms like random forest, you might not have exactly the same results throughout this modeling exercise. This is to be expected.

Check model performance#

You’ll also want to dive deeper into a model’s performance, starting with basic metrics for a classification problem like accuracy, precision, and recall.

  1. In the Performance section, click to open the Confusion matrix panel to check how well the model did at classifying real and fake job postings.

Dataiku screenshot of the confusion matrix for a model in the Lab.

Check model information#

Alongside the results, you’ll also want to be sure how exactly the model was trained.

  1. In the Model Information section, click to open the Features panel to check which features were included in the model, which were rejected (such as the text features), and how they were handled.

  2. When finished, click on Models to return to the Result home.

Dataiku screenshot of the feature handling for a model in the Lab.

Iterate on the design of a model training session#

See a screencast covering this section’s steps

Thus far, Dataiku has produced quick prototypes. From these baseline models, you can work on iteratively adjusting the design, training new sessions of models, and evaluating the results.

  1. Switch to the Design tab.

Dataiku screenshot of the Design tab.

Tour the Design tab#

From the Design tab, you have full control over the design of a model training session. Take a quick tour of the available options. Some examples include:

  1. In the Train / Test Set panel, you could apply a k-fold cross validation strategy.

  2. In the Feature reduction panel, you could apply a reduction method like Principal Component Analysis.

  3. In the Algorithms panel, you could select different machine learning algorithms or import custom Python models.

Dataiku screenshot of the model design tab showing the algorithms panel.

Reduce the number of features#

Instead of adding complexity, let’s simplify the model by including only the most important features. Having fewer features could hurt the model’s predictive performance, but it may bring other benefits, such as greater interpretability, faster training times, and reduced maintenance costs.

  1. In the Design tab, navigate to the Features handling panel.

  2. Click the box at the top left of the feature list to select all.

  3. For the role, click Reject to de-select all features.

  4. Turn on the three most influential features according to the Feature importance chart seen earlier: country, has_company_logo, and len_company_profile.

Dataiku screenshot of the feature handling panel in the Design tab.

Tip

Your top three features may be slightly different. Feel free to choose these three or the three most important from your own results.

Train a second session#

Once you have just the top three features in the model design, you can kick off another training session.

  1. Click the blue Train button.

  2. Click Train once more to confirm.

Dataiku screenshot of the interface to train a model.

Apply a model to generate predictions on new data#

See a screencast covering this section’s steps

Up until now, the models you’ve trained are present only in the Lab, a space for experimental prototyping and analysis. You can’t actually use any of these models until you have added them to the Flow, where your actual project pipeline of datasets and recipes lives. Let’s do that now!

Choose a model to deploy#

Many factors could impact the choice of which model to deploy. For many use cases, the model’s performance is not the only deciding factor.

Compared to the larger model, the simple model with three features cost about 4 hundredths of a point in performance. For some use cases, this may be a huge amount, but in others it may be a bargain for a model that is more interpretable, cheaper to train, and easier to maintain. Since performance is not too important in this tutorial, let’s choose the simpler option.

  1. From the Result tab, click Random forest (s2) to open the model report of the simpler random forest model from Session 2.

Dataiku screenshot of the Results tab of a visual analysis.

Now you just need to deploy this model from the Lab to the Flow.

  1. Click Deploy.

  2. Click Create to confirm.

Dataiku screenshot of the dialog for deploying a model to the Flow.

Explore a saved model object#

You now have two green objects in the Flow that you can use to generate predictions on new data: a training recipe and a saved model object.

  1. In the Machine Learning Flow zone, double click on the diamond-shaped saved model to open it.

  2. Note the Active version label on the tile for the only model version available.

Dataiku screenshot of a saved model.

Note

As you retrain the model, perhaps due to new data or additional feature engineering, you’ll deploy new active versions of the model to the Flow. However, you’ll still have the ability to revert to previous model versions at any time.

Score data#

Now let’s use the model in the Flow to generate predictions on a new dataset of job postings that the model has not seen before.

  1. From the saved model screen, click to open the Actions tab.

  2. Select the Score recipe.

  3. For the input dataset, choose test.

  4. Click Create Recipe, accepting the default output name.

  5. Once on the Settings tab of the Score recipe, click Run (or type @ + r + u + n) to execute the recipe with the default settings.

Dataiku screenshot of the dialog for creating a Score recipe.

Tip

Here we applied the Score recipe to a model trained with Dataiku’s visual AutoML. However, you also have the option to surface models deployed on external cloud ML platforms within Dataiku and use them for scoring, monitoring, and more.

Inspect the scored data#

Compare the schemas of the test and test_scored datasets.

  1. When the job finishes, click Explore dataset test_scored.

  2. Note the addition of three new columns: proba_0, proba_1, and prediction.

  3. Navigate back to the Flow to see the scored dataset in the pipeline.

Dataiku screenshot of the output to the Score recipe.

Tip

How well was the model able to identify the fake job postings in the test dataset? That is a task for the Evaluate recipe, which you will encounter in other learning resources, such as the ML Practitioner learning path.

What’s next?#

Congratulations! You’ve taken your first steps toward training machine learning models with Dataiku and using them to score data.

You’re now ready to begin the ML Practitioner learning path and challenge yourself to earn the ML Practitioner certification.

Another option is to dive into the world of MLOps. In the Quick Start | Dataiku for MLOps, you can deploy an API endpoint from the model you’ve just created, and use it to answer real-time queries.

Note

You can also find more resources on machine learning in the following spaces: