Split data into training and testing sets#
See a screencast covering this section’s steps
One advantage of an end-to-end platform like Dataiku is that data preparation can be done in the same tool as machine learning. For example, before building a model, you may wish to create a holdout set. Let’s do this with a visual recipe.
From the Flow, click the job_postings_prepared_joined dataset once to select it.
Open the Actions (
) tab of the right panel.
Select Split from the menu of visual recipes.
Click + Add; name the output
train
; and click Create Dataset.Click + Add again; name the second output
test
; and click Create Dataset.Once you have defined both output datasets, click Create Recipe.
data:image/s3,"s3://crabby-images/1ca16/1ca16dc7810a96780211c8fe97bfa39f117af57b" alt="Dataiku screenshot of the dialog to create a Split recipe."
Define a Split method#
The Split recipe allows you to divide the input dataset into some number of output datasets in different ways. Possible splitting methods include mapping values of a column, defining filters, or as you’ll see here, randomly:
On the Splitting step of the recipe, select Randomly dispatch data as the splitting method.
Set the ratio of
80
% to the train dataset, and the remaining 20% to the test dataset.Click the Run at the bottom left (or type
@
+r
+u
+n
) to build these two output datasets.When the job finishes, navigate back to the Flow (
g
+f
) to see your progress.
data:image/s3,"s3://crabby-images/16c39/16c395e04b5e198b6bfd1aa2156696c513d0b2db" alt="Dataiku screenshot of the settings for a Split recipe."
Create a separate Flow zone#
Before you start training models, there’s one organizational step that will be helpful as your projects grow in complexity. Let’s create a separate Flow zone for the machine learning stage of this project.
Use the
Command/Ctrl
key and the cursor to select both the train and test datasets.Open the Actions (
) tab of the right panel.
In the Flow Zones section, click Move.
Name the new zone
Machine Learning
.Click Confirm.
data:image/s3,"s3://crabby-images/28d19/28d19f712471cab47f47eca182ad371126c3e267" alt="Dataiku screenshot of the dialog for creating a Flow zone."
Now just rename the default zone, and you’ll have two clear spaces for these two stages of the project.
Click on the original Default zone.
Open the Actions (
) tab.
Select Edit.
Give the name
Data Preparation
.Click Confirm.
data:image/s3,"s3://crabby-images/fa54d/fa54daadd0f980d9ab9663747057afc16a0b6645" alt="Dataiku screenshot of the dialog for editing a zone."