Tutorial | Prepare recipe#

The Prepare recipe allows you to define a series of steps, or actions, to take on the dataset.

Get started#

Objectives#

In this tutorial, you will:

  • Create a Prepare recipe.

  • Add various processor steps in different ways.

  • Parse date columns.

  • Use the Formula language.

Prerequisites#

To complete this tutorial, you’ll need the following:

  • A Dataiku instance (version 9.0 and above).

Create the project#

To create the project:

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Prepare Recipe.

  2. From the project homepage, click Go to Flow (or G + F).

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

Let’s say we’re a company selling t-shirts and our dataset details each t-shirt order from 2013 to 2017. The data needs to be cleaned up and prepared. For example, we realized our categories of t-shirts are not consistently named and the dates are unparsed. We can make all the changes with a Prepare recipe.

Create a Prepare recipe#

Tip

In addition to the written instructions and screenshots, you’ll also find several short screencasts recording the actions described after some sections.

Let’s get started preparing the dataset!

  1. Open the orders dataset.

  2. Click on the Actions button or the plus sign at the top-right of the screen to expand the right panel.

  3. Under the Visual Recipes section, click on Prepare.

    The New data preparation recipe window opens.

  4. Keep orders as the input dataset and accept the default output dataset name of orders_prepared.

    Note

    You can always change the name of our dataset later if you decide on a more descriptive name or if you want to apply a dataset naming convention. Dataiku lets you rename datasets once they are created. To rename a dataset, follow the How-to | Rename a dataset article.

  5. Click Create Recipe.

Dataiku screenshot of the dialog for a new dataset preparation recipe.

Add a step from dragging columns#

The types of steps you can add to a Prepare recipe are wide-ranging and powerful. One example is reordering columns.

  1. Drag the order_id column in front of the pages_visited column. A step describing this action is automatically added to the recipe’s Script.

Step in the script of a Prepare recipe.

Add a step through the Analyze window#

The Analyze window provides a quick summary of the column data based on a sample subset of data. It also allows you to perform various data cleansing actions.

In this section, we’ll use it to recode the values of the tshirt_category column to standardize the t-shirt categories.

  1. Click on the column name tshirt_category, and select Analyze from the dropdown menu.

    Dataiku screenshot of the column header dropdown selecting Analyze.
  2. Select White T-Shirt M and Wh Tshirt M.

  3. From the Mass Actions dropdown, choose Merge selected.

  4. Choose to replace the values with White T-Shirt M, and click Merge.

  5. Repeat this process for other categories until only six categories remain.

    Analyze window showing a mass action option to merge two categorical columns and replace the column header.
  6. When all necessary replacements have been made, close the Analyze window, and see that a Replace step has been added to the Prepare script.

    Step preview of a script in a Prepare recipe.

Replacing the four values in this step affects 517 rows in the sample. We could have created this step explicitly in the script, but the Analyze window provides a quick and intuitive shortcut to build the step.

Note

Notice that we are now in Step preview mode, which means that any changes made by the script step is highlighted. Above, you can see that the values in blue were modified by the Replace step.

See a screencast covering these steps

Add a suggested step from a column header#

Now, let’s deal with the order_date column.

Parse the date#

At this point, the storage type of the order_date column is a string, but its meaning inferred by Dataiku is an unparsed date. Let’s parse it so that we can treat it as a proper date.

  1. Open the order_date column dropdown, and select Parse date.

    It opens the Smart Date dialog, which shows the most likely formats for our dates and what the dates would look like once parsed, with some sample values from the dataset. In our case, the dates appear to be in yyyy/MM/dd format.

  2. With the yyyy/MM/dd format selected, click Use Date Format.

  3. See that a Parse date step has been added to the script.

By default, this step creates a new column named order_date_parsed. Note how both its storage type and meaning are a date.

Rename/Delete a column#

In the Parse date script step, we could leave the name of the output column empty in order to parse the column in place. Instead of overriding the original date column, let’s delete it and rename the new column.

  1. Click on the order_date column header dropdown, and choose Delete.

  2. Click on the order_date_parsed column header dropdown, and choose Rename.

  3. Enter the name order_date, and click OK.

  4. For the last step in the script, click on the eye icon to re-enable the step preview.

Dataiku screenshot of a Prepare recipe step parsing dates.

See a screencast covering these steps

Add a Formula step from the processors library#

Finally, let’s use a Formula step to compute the dollar value of each t-shirt order. Dataiku formulas are a very powerful expression language to perform calculations, manipulate strings, and much more.

This time, we will not add the step by clicking on a column header, but instead search the processors library which includes about 100 data preparation processors.

  1. Click the yellow +Add a New Step button near the bottom left of the page.

  2. Select Formula (you can search for it).

  3. Enter total as the name of the new output column.

  4. In the expression field, type tshirt_price * tshirt_quantity (you can also select Open Editor Panel to use the advanced formula editor, which will autocomplete column names).

  5. Click anywhere, and see the new total column appear.

  6. Remove both the tshirt_price and tshirt_quantity columns by clicking on the column header and choosing Delete.

Dataiku screenshot of a Prepare recipe step parsing dates.

Recall that the data visible in the recipe is a preview of what your output dataset will look like based on a sample of data. With our data preparation finished, we must now run the recipe on the whole input dataset.

Run the recipe#

Once all the steps are defined in your script, it is time to apply the script on the whole dataset and explore the output dataset.

  1. Click Run in the lower-left corner of the page (or use the keyboard shortcut @+r+u+n).

  2. When the job completes, click Explore dataset orders_prepared to view the output dataset. You can also return to the Flow and see your progress.

What’s next?#

Congratulations on completing your first Prepare recipe! There’s still more data exploration and preparation to be done.

Next, let’s dive deeper by grouping some data on a dataset.