Tutorial | Prepare your data (Core Designer part 4)¶
At the end of the tutorial on visually exploring data, we realized our categories of t-shirts needed to be consistently named. We can make this happen with a Prepare recipe.
The Prepare recipe allows you to define a series of steps, or actions, to take on the dataset.
In this tutorial, you will:
Create a Prepare recipe.
Add various processor steps in different ways.
Parse date columns.
Use the Formula language.
In addition to the written instructions and screenshots, you’ll also find several short screencasts recording the actions described after some sections.
Let’s get started preparing the dataset!
Open the orders dataset.
Click on the Actions button or the plus sign at the top-right of the screen to expand the right panel.
In the Actions tab, under the Visual Recipes section, click on Prepare. The New data preparation recipe window opens.
Provide an input dataset and accept the default output dataset name of
Click Create Recipe.
You can also set the value of Store into to decide where the output data will live. In this example, the output is written to the local filesystem. Alternatively, the output could be written to a relational database or a distributed filesystem if the infrastructure exists. In this case, use the default choice for your instance.
The types of steps you can add to a Prepare recipe are wide-ranging and powerful. One example is reordering columns.
Drag the order_id column in front of the pages_visited column. A step describing this action is automatically added to the recipe’s Script.
The Analyze window provides a quick summary of the column data based on a sample subset of data. It also allows you to perform various data cleansing actions.
In this section, we’ll use it to recode the values of the tshirt_category column to standardize the t-shirt categories.
Click on the column name tshirt_category, and select Analyze from the dropdown menu.
Select White T-Shirt M and Wh Tshirt M.
From the Mass Actions dropdown, choose Merge selected.
Choose to replace the values with White T-Shirt M, and click Merge.
Repeat this process for other categories until only six categories remain.
When all necessary replacements have been made, close the Analyze window, and see that a Replace step has been added to the Prepare script.
Replacing the four values in this step affects 517 rows in the sample. We could have created this step explicitly in the script, but the Analyze window provides a quick and intuitive shortcut to build the step.
Notice that we are now in Step preview mode, which means that any changes made by the script step is highlighted. Above, you can see that the values in blue were modified by the Replace step.
Once the changes reviewed, you can click on the Disable preview button in the top bar to see your data as it will appear after processing.
Now, let’s deal with the order_date.
Parse the date¶
At this point, the storage type of the order_date column is a string, but its meaning inferred by Dataiku is an unparsed date. Let’s parse it so that we can treat it as a proper date.
Open the order_date column dropdown, and select Parse date. It opens the Smart Date dialog, which shows the most likely formats for our dates and what the dates would look like once parsed, with some sample values from the dataset. In our case, the dates appear to be in yyyy/MM/dd format.
With the yyyy/MM/dd format selected, click Use Date Format.
See that a Parse date step has been added to the script.
By default, this step creates a new column named order_date_parsed. Note how both its storage type and meaning are a date.
Rename/Delete a column¶
In the Parse date script step, we could leave the name of the output column empty in order to parse the column in place. Instead of overriding the original date column, let’s delete it and rename the new column.
Click on the order_date column header dropdown, and choose Delete.
Click on the order_date_parsed column header dropdown, and choose Rename.
Enter the name
order_date, and click OK.
For the last step in the script, click on the eye icon to re-enable the step preview.
Finally, let’s use a Formula step to compute the dollar value of each t-shirt order. Dataiku formulas are a very powerful expression language to perform calculations, manipulate strings, and much more.
This time, we will not add the step by clicking on a column header, but instead search the processors library which includes all 100+ data preparation processors.
Click the yellow +Add a New Step button near the bottom left of the page.
Select Formula (you can search for it).
totalas the name of the new output column.
In the expression field, type
tshirt_price * tshirt_quantity(you can also select Open Editor Panel to use the advanced formula editor, which will autocomplete column names).
Click anywhere, and see the new total column appear.
Remove both the tshirt_price and tshirt_quantity columns by clicking on the column header and choosing Delete.
Recall that the data visible in the recipe is a preview of what your output dataset will look like based on a sample of data. With our data preparation finished, we must now run the recipe on the whole input dataset.
Once all the steps defined in your script, it is time to apply the script on the whole dataset and explore the output dataset.
Click Run in the lower-left corner of the page (or use the keyboard shortcut
Dataiku uses its own engine for this recipe, but depending on your infrastructure and the type of recipe, you can choose where the computation takes place.
When the job completes, click Explore dataset orders_prepared to view the output dataset. You can also return to the Flow and see your progress.
We can always change the name of our dataset later if we decide on a more descriptive name or if we want to apply a dataset naming convention. Dataiku lets you rename datasets once they are created. To rename a dataset, follow the How-to | Rename a dataset article.