Split data into training and testing sets#

See a screencast covering this section’s steps

One advantage of an end-to-end platform like Dataiku is that data preparation can be done in the same tool as machine learning. For example, before building a model, you may wish to create a holdout set. Let’s do this with a visual recipe.

  1. From the Flow, click the job_postings_prepared_joined dataset once to select it.

  2. Open the Actions tab.

  3. Select Split from the menu of visual recipes.

  4. Click + Add; name the output train; and click Create Dataset.

  5. Click + Add again; name the second output test; and click Create Dataset.

  6. Once you have defined both output datasets, click Create Recipe.

Dataiku screenshot of the dialog to create a Split recipe.

Define a Split method#

The Split recipe allows you to divide the input dataset into some number of output datasets in different ways, such as by mapping values of a column, defining filters, or as you’ll see here, randomly:

  1. On the Splitting step of the recipe, select Randomly dispatch data as the splitting method.

  2. Set the ratio of 80 % to the train dataset, and the remaining 20% to the test dataset.

  3. Click the Run at the bottom left (or type @ + r + u + n) to build these two output datasets.

  4. When the job finishes, navigate back to the Flow (g + f) to see your progress.

Dataiku screenshot of the settings for a Split recipe.

Create a separate Flow zone#

Before you start training models, there’s one organizational step that will be helpful as your projects grow in complexity. Let’s create a separate Flow zone for the machine learning stage of this project.

  1. Use the Command/Ctrl key and the cursor to select both the train and test datasets.

  2. Open the Actions tab.

  3. In the Flow Zones section, click Move.

  4. Name the new zone Machine Learning.

  5. Click Confirm.

Dataiku screenshot of the dialog for creating a Flow zone.

Now just rename the default zone, and you’ll have two clear spaces for these two stages of the project.

  1. Click on the original Default zone.

  2. Open the Actions tab.

  3. Select Edit.

  4. Give the name Data Preparation.

  5. Click Confirm.

Dataiku screenshot of the dialog for editing a zone.