Tutorial | Prepare your data (Core Designer part 4)

At the end of the tutorial on visually exploring data, we realized our categories of t-shirts needed to be consistently named. We can make this happen with a Prepare recipe.

The Prepare recipe allows you to define a series of steps, or actions, to take on the dataset.


In this tutorial, you will:

  • Create a Prepare recipe.

  • Add various processor steps in different ways.

  • Parse date columns.

  • Use the Formula language.

Starting here?

If you skipped the previous sections, you need to:

  • Have access to a Dataiku instance (version 9.0 or above).

  • Create the project (+New Project > DSS tutorials > Core Designer / Basics > Basics 102)) or download and import it from this website.

Create a Prepare recipe


In addition to the written instructions and screenshots, you’ll also find several short screencasts recording the actions described after some sections.

Let’s get started preparing the dataset!

  1. Open the orders dataset.

  2. Click on the Actions button or the plus sign at the top-right of the screen to expand the right panel.

  3. In the Actions tab, under the Visual Recipes section, click on Prepare. The New data preparation recipe window opens.

  4. Provide an input dataset and accept the default output dataset name of orders_prepared.

  5. Click Create Recipe.

Dataiku screenshot of the dialog for a new dataset preparation recipe.


You can also set the value of Store into to decide where the output data will live. In this example, the output is written to the local filesystem. Alternatively, the output could be written to a relational database or a distributed filesystem if the infrastructure exists. In this case, use the default choice for your instance.

Add a step from dragging columns

The types of steps you can add to a Prepare recipe are wide-ranging and powerful. One example is reordering columns.

  1. Drag the order_id column in front of the pages_visited column. A step describing this action is automatically added to the recipe’s Script.

Step in the script of a Prepare recipe.

Add a step through the Analyze window

The Analyze window provides a quick summary of the column data based on a sample subset of data. It also allows you to perform various data cleansing actions.

In this section, we’ll use it to recode the values of the tshirt_category column to standardize the t-shirt categories.

  1. Click on the column name tshirt_category, and select Analyze from the dropdown menu.

    Dataiku screenshot of the column header dropdown selecting Analyze.

  2. Select White T-Shirt M and Wh Tshirt M.

  3. From the Mass Actions dropdown, choose Merge selected.

  4. Choose to replace the values with White T-Shirt M, and click Merge.

  5. Repeat this process for other categories until only six categories remain.

    Analyze window showing a mass action option to merge two categorical columns and replace the column header.

  6. When all necessary replacements have been made, close the Analyze window, and see that a Replace step has been added to the Prepare script.

Step preview of a script in a Prepare recipe.

Replacing the four values in this step affects 517 rows in the sample. We could have created this step explicitly in the script, but the Analyze window provides a quick and intuitive shortcut to build the step.


Notice that we are now in Step preview mode, which means that any changes made by the script step is highlighted. Above, you can see that the values in blue were modified by the Replace step.

Once the changes reviewed, you can click on the Disable preview button in the top bar to see your data as it will appear after processing.

See a video covering these steps

Add a suggested step from a column header

Now, let’s deal with the order_date.

Parse the date

At this point, the storage type of the order_date column is a string, but its meaning inferred by Dataiku is an unparsed date. Let’s parse it so that we can treat it as a proper date.

  1. Open the order_date column dropdown, and select Parse date. It opens the Smart Date dialog, which shows the most likely formats for our dates and what the dates would look like once parsed, with some sample values from the dataset. In our case, the dates appear to be in yyyy/MM/dd format.

  2. With the yyyy/MM/dd format selected, click Use Date Format.

  3. See that a Parse date step has been added to the script.

By default, this step creates a new column named order_date_parsed. Note how both its storage type and meaning are a date.

Rename/Delete a column

In the Parse date script step, we could leave the name of the output column empty in order to parse the column in place. Instead of overriding the original date column, let’s delete it and rename the new column.

  1. Click on the order_date column header dropdown, and choose Delete.

  2. Click on the order_date_parsed column header dropdown, and choose Rename.

  3. Enter the name order_date, and click OK.

  4. For the last step in the script, click on the eye icon to re-enable the step preview.

Dataiku screenshot of a Prepare recipe step parsing dates.

See a video covering these steps

Add a Formula step from the processors library

Finally, let’s use a Formula step to compute the dollar value of each t-shirt order. Dataiku formulas are a very powerful expression language to perform calculations, manipulate strings, and much more.

This time, we will not add the step by clicking on a column header, but instead search the processors library which includes all 100+ data preparation processors.

  1. Click the yellow +Add a New Step button near the bottom left of the page.

  2. Select Formula (you can search for it).

  3. Enter total as the name of the new output column.

  4. In the expression field, type tshirt_price * tshirt_quantity (you can also select Open Editor Panel to use the advanced formula editor, which will autocomplete column names).

  5. Click anywhere, and see the new total column appear.

  6. Remove both the tshirt_price and tshirt_quantity columns by clicking on the column header and choosing Delete.

Dataiku screenshot of a Prepare recipe step parsing dates.

Recall that the data visible in the recipe is a preview of what your output dataset will look like based on a sample of data. With our data preparation finished, we must now run the recipe on the whole input dataset.

Run the recipe

Once all the steps defined in your script, it is time to apply the script on the whole dataset and explore the output dataset.

  1. Click Run in the lower-left corner of the page (or use the keyboard shortcut @+r+u+n).


    Dataiku uses its own engine for this recipe, but depending on your infrastructure and the type of recipe, you can choose where the computation takes place.

  2. When the job completes, click Explore dataset orders_prepared to view the output dataset. You can also return to the Flow and see your progress.


We can always change the name of our dataset later if we decide on a more descriptive name or if we want to apply a dataset naming convention. Dataiku lets you rename datasets once they are created. To rename a dataset, follow the How-to | Rename a dataset article.

What’s next?

Congratulations on completing your first Prepare recipe! However, there’s much more data exploration and preparation to be done.

Next, let’s dive deeper by computing statistics on a dataset.