Tutorial | Prepare recipe#

The Prepare recipe allows you to define a series of steps, or actions, to take on the dataset.

Get started#

Objectives#

In this tutorial, you will:

  • Create a Prepare recipe.

  • Add various processor steps in different ways.

  • Process dates, numbers and text.

  • Use the Formula language.

Prerequisites#

To complete this tutorial, you’ll need the following:

  • A Dataiku instance (version 12.0 and above).

Create the project#

To create the project:

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Prepare Recipe.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Use case summary#

Let’s say we’re a financial company that uses some credit card data to detect fraudulent transactions.

The project comes with three datasets, described in the table below.

Dataset

Description

tx

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

  • Authorized (a score of 1 in the authorized_flag column)

  • Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Create a Prepare recipe#

Let’s get started preparing the dataset!

  1. From the Flow, select the tx_joined dataset and click the Actions icon (+) from the right panel to open the Actions tab.

  2. Under the Visual Recipes section, click on Prepare.

    The New data preparation recipe window opens.

  3. Keep tx_joined as the input dataset.

  4. Name the output dataset tx_prepared.

  5. Click Create Recipe.

Dataiku screenshot of the dialog for a new dataset preparation recipe.

Add steps to the recipe#

Now that we have our Prepare recipe, it’s time to add some steps to it.

Process dates#

Let’s start with dates.

Only display date columns#

This step is optional but it shows you how to display only the columns you select.

Here, we want to focus our attention on the date columns, which are purchase_date and first_active_month. To do so:

  1. At the top right of the dataset preview, click on Display > Select displayed columns.

  2. In the Displayed columns dialog:

    • Set the Display option to Selected columns.

    • Unselect all columns using the checkbox at the top, to the left of the search field.

    • Select purchase_date and first_active_month.

  3. Click OK. The dataset should now display two columns only.

    Note

    Recall that the data visible in the recipe is a preview of a sample of your data.

  4. Click Save.

Dataiku screenshot of display filters.

Parse dates#

In this section, we’ll parse the two date columns: purchase_date and first_active_month.

At this point, the storage type of these columns is a string, but their meanings inferred by Dataiku is an unparsed date. Let’s parse them so that we can treat them as a proper date.

  1. Open the purchase_date column header dropdown, and select Parse date.

    It opens the Smart Date dialog, which shows the most likely formats for our dates and what the dates would look like once parsed, with some sample values from the dataset. In our case, the dates appear to be in yyyy-MM-dd format.

  2. With the yyyy-MM-dd format selected, click Use Date Format.

  3. See that a Parse date step has been added to the script.

  4. Repeat steps 1 to 3 on first_active_month, using the yyyy-MM format.

  5. Click Save.

By default, these steps create two new columns named purchase_date_parsed and first_active_month_parsed. Note how both their storage types and meanings are a date.

Dataiku screenshot of a Prepare recipe step parsing dates.

Delete and rename columns#

In the Parse date script step, we could leave the name of the output column empty in order to parse the column in place. Instead of overriding the original date column, let’s delete it and rename the new column.

  1. Click on the purchase_date column header dropdown, and choose Delete. It adds a Remove columns step to the Script panel.

  2. In the Remove columns step, select multiple next to Column and add the first_active_month column.

  3. Click on the purchase_date_parsed column header dropdown, and choose Rename.

  4. Enter the name purchase_date, and click OK.

  5. Repeat steps 3 and 4 to rename first_active_month_parsed into first_active_month.

  6. For the last step in the script, click on the eye icon to re-enable the step preview.

    Note

    Recall that the data visible in the recipe is a preview of what your output dataset will look like based on a sample of data.

  7. Click Save.

Dataiku screenshot of a Prepare recipe step parsing dates.

Extract date components#

Now that the dates are parsed, let’s extract some date components from purchase_date, like the year, month, day and day of the week.

  1. Click on the purchase_date column header dropdown, and choose Extract date components. It adds a step to the Script panel and three columns to the dataset.

  2. In the script, set the following options as below:

    • In the Year column field, enter purchase_year.

    • In the Month column field, enter purchase_month.

    • In the Day column field, enter purchase_day.

    • In the Day of week column field, enter purchase_dow.

  3. Click Save.

Dataiku screenshot of a Prepare recipe step extracting date components.

Important

The values in the purchase_dow (day of the week) column ranges from 1 to 7, where 1 corresponds to Monday.

Compute time difference#

Let’s compute the time from the first active month to the purchase date.

  1. Click on the first_active_month column header dropdown, and choose Compute time since. It adds a step to the Script panel.

  2. In the script, configure the Compute time difference step as follows:

    • Leave the Time since column option to first_active_month.

    • Set the Until option to Another date column.

    • Set the Other column option to purchase_date.

    • Ensure the Output time unit option is set to Days.

    • Enter days_active as the name of the Output column.

  3. Click Save.

Dataiku screenshot of a Compute time difference step in a Prepare recipe.

Show all columns from the dataset#

Now that we’ve finished processing the dates, let’s display all the columns in the dataset.

  1. At the top right of the dataset preview, click on Display > Select displayed columns.

  2. In the Displayed columns dialog, set the Display option to All columns.

  3. Click OK. The dataset should now display all columns.

  4. Click Save.

Dataiku screenshot of a Display menu in a Prepare recipe.

Process numbers#

Convert currencies#

One handy processor provides the ability to convert currencies based on a date column.

This time, we will not add the step by clicking on a column header, but instead search the processors library which includes about 100 data preparation processors.

  1. At the bottom of the Script panel, click + Add a New Step.

  2. Search for Convert currencies.

  3. Enter purchase_amount as the column.

  4. Select USD as the input currency.

  5. Select EUR as the output currency.

  6. For input date source, select From Column (Date), and enter purchase_date as the date column.

  7. Name the output column purchase_amount_eur.

  8. Click Save.

Dataiku screenshot of converting currency in a prepare recipe.

Round numbers#

Let’s also round the original purchase_amount values for simplicity.

  1. Click on the column header for purchase_amount.

  2. Click on Round to integer.

  3. Increase the Decimal places to 2.

  4. Click Save.

Dataiku screenshot of rounding numbers in a Prepare recipe.

Tip

In the Script panel, try to disable/enable this step. Dataiku processes the steps sequentially. So, rounding the amounts after the conversion step does not have any impact on the purchase_amount_eur col.

Dataiku screenshot showing how to disable a step in a Prepare recipe.

Process text#

Simplify text#

If we want to work with a text or natural language column, a good starting point is often to normalize the text. When you do so, Dataiku transforms all text to lowercase, removes punctuation and accents, and performs unicode normalization.

  1. Click on the column header for product_title.

  2. Click Simplify text.

  3. Leave the default option to normalize the text.

  4. Click Save.

Dataiku screenshot of the Simplify text step in a visual analysis.

Extract text with a regular expression#

We can manipulate string data with regular expressions in many places throughout Dataiku, including in data preparation. As an example, let’s use a regular expression to extract into a new column the name of the first match of common Apple products.

  1. At the bottom of the Script panel, click + Add a New Step.

  2. Search for Extract with regular expression.

  3. Enter:

    • product_title as the input column.

    • apple as the prefix for the output column.

    • \b(apple|macbook|ipad|iphone|ipod)\b as the regular expression.

  4. Click Save.

  5. As this regular expression does not impact all rows, to see the impact more easily, let’s filter the column:

    • From the apple1 column header dropdown, select Filter.

    • In the filter box, check all values except No value to exclude rows that are empty.

  6. Once you’ve reviewed the apple1 column, delete the filter by clicking the bin icon next to it.

Dataiku screenshot of regex extraction step in a Prepare recipe.

Note

Here, we’ve provided the regular expression for you, but you can explore how to use the smart pattern builder on your own.

Use a formula#

Finally, let’s use a Formula step to create a binary column named purchase_weekend. Dataiku formulas are a very powerful expression language to perform calculations, manipulate strings, and much more.

  1. At the bottom of the Script panel, click + Add a New Step.

  2. Search for Formula.

  3. In the Script panel, enter purchase_weekend as the name of the new output column.

  4. Click the Open Editor Panel to use the advanced formula editor, which will autocomplete column names and show you a preview of the output.

  5. In the Formula field, enter if(purchase_dow>5,1,0).

  6. Click Apply then Save.

Important

The purchase_weekend column appears in the dataset only after clicking the Apply button.

Dataiku screenshot of a formula in a Prepare recipe.

As you can see, purchase_weekend is set to:

  • 0 (false) when the day of the week stored in the purchase_dow column ranges from 1 to 5, as 1 corresponds to Monday and 5 to Friday.

  • 1 (true) when the day of the week is either 6 or 7 (Saturday and Sunday, respectively).

Group steps together#

Dataiku lets you group steps to organize the Script panel. Here let’s group together the steps used to process the dates.

  1. In the Script panel, select the first six steps (two Parse date steps, Remove, Rename, Extract date and Compute time difference).

  2. On top of the step list, click Actions > Group.

  3. Name the group Process dates.

  4. Click Save.

Dataiku screenshot showing how to group steps in a Prepare recipe.

At any time, you can click on the group to expand all the steps it includes.

Run the recipe#

Once all the steps are defined in your script, it is time to apply the script on the whole dataset and explore the output dataset.

  1. Click Run in the lower-left corner of the page (or use the keyboard shortcut @+r+u+n).

  2. When the job completes, click Explore dataset tx_prepared to view the output dataset. You can also return to the Flow and see your progress.

What’s next?#

Congratulations on completing your first Prepare recipe! This tutorial only shows a small subset of what can be done in a Prepare recipe.

See also

For more information on data preparation, see also:

Now, there’s still more data exploration and preparation to be done using other visual recipes.

Let’s dive deeper by grouping some data on a dataset.