Tutorial | Prepare recipe#

The Prepare recipe allows you to define a series of steps, or actions, to take on the dataset.

Get started#

Objectives#

In this tutorial, you will:

  • Create a Prepare recipe.

  • Add various processor steps in different ways.

  • Process dates, numbers and text.

  • Use the Formula language.

Prerequisites#

To complete this tutorial, you’ll need the following:

  • Dataiku 14.0 or later.

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Prepare Recipe.

  4. If needed, change the folder into which the project will be installed, and click Create.

  5. From the project homepage, click Go to Flow (or type g + f).

Note

You can also download the starter project from this website and import it as a ZIP file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Use case summary#

Let’s say we’re a financial company that uses some credit card data to detect fraudulent transactions.

The project comes with three datasets, described in the table below.

Dataset

Description

tx

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

  • Authorized (a score of 1 in the authorized_flag column)

  • Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Create a Prepare recipe#

Let’s get started preparing the dataset!

  1. From the Flow, select the tx_joined dataset, and click the Actions (Actions icon.) icon from the right panel.

  2. Under the Visual Recipes section, click on Prepare.

    The New data preparation recipe window opens.

  3. Keep tx_joined as the input dataset.

  4. Name the output dataset tx_prepared.

  5. Click Create Recipe.

Dataiku screenshot of the dialog for a new dataset preparation recipe.

Add steps to the recipe#

Now that you have a Prepare recipe, it’s time to add some steps to it.

Process dates#

Let’s start with dates.

Only display date columns#

This step is optional, but it shows you how to display only the columns you select.

Here, the aim is to focus your attention on the date columns, which are purchase_date and first_active_month. To do so:

  1. At the top right of the dataset preview, click on Display > Select displayed columns.

  2. In the Displayed columns dialog:

    • Set the Display option to Selected columns.

    • Deselect all columns using the checkbox at the top, to the left of the search field.

    • Select purchase_date and first_active_month.

  3. Click OK. The dataset should now display two columns only.

    Note

    Recall that the data visible in the recipe is a preview of a sample of your data.

Dataiku screenshot of display filters.

Parse dates#

The dataset includes two columns with date information, but their storage type and meaning show important differences.

Column

Storage Type

Meaning

Value

purchase_date

date only

Date only

2017-08-24

first_active_month

string

Date (unparsed)

2016-02

Unlike the purchase_date column, first_active_month might look like it contains dates, but it’s actually holding strings. Therefore, you can’t use it for date operations. You’ll need to turn it from a string into a true date.

  1. Open the first_active_month column header dropdown.

  2. Select Parse date.

  3. With the yyyy-MM format selected, click Use Date Format.

  4. In the preview of the output column first_active_month_parsed, note the change in the column’s storage type, meaning, and values.

Dataiku screenshot of a Prepare recipe step parsing dates.

Delete and rename columns#

In the script’s Parse date step, you could leave the name of the output column empty to parse the column in place. However, instead of overriding the original date column, delete it, and rename the new column.

  1. Click on the first_active_month column header dropdown, and choose Delete. It adds a Remove columns step to the Script panel.

  2. Click on the first_active_month_parsed column header dropdown, and choose Rename.

  3. Enter the name first_active_month, and click OK.

  4. For the last step in the script, click on the eye (View impact icon.) icon to re-enable the step preview.

Dataiku screenshot of a Prepare recipe step parsing dates.

Extract date components#

The correct storage of date columns enables usage of several processors built to work on dates.

  1. Open the purchase_date column header dropdown, and choose Extract date components.

  2. In the script, name the output columns as:

    • In the Year column field, enter purchase_year.

    • In the Month column field, enter purchase_month.

    • In the Day column field, enter purchase_day.

    • In the Day of week column field, enter purchase_dow.

Dataiku screenshot of a Prepare recipe step extracting date components.

Important

The values in the purchase_dow (day of the week) column ranges from 1 to 7, where 1 corresponds to Monday.

Compute time difference#

Now that first_active_month is a true date, you can use it in date calculations. Compute the time from the card’s activation month to the purchase date.

  1. Click on the first_active_month column header dropdown, and choose Compute time since.

  2. In the script, configure the Compute time difference step as follows:

    • Leave the Time since column option to first_active_month.

    • Set the until option to Another date column.

    • Set the Other column option to purchase_date.

    • Ensure the Output time unit option is Days.

    • Enter days_active as the name of the Output column.

Dataiku screenshot of a Compute time difference step in a Prepare recipe.

Show all columns from the dataset#

Now that you’ve finished processing the dates, display all the columns in the dataset.

  1. At the top right of the dataset preview, click on Display > Select displayed columns.

  2. In the Displayed columns dialog, set the Display option to All columns.

  3. Click OK. The dataset should now display all columns.

Dataiku screenshot of a Display menu in a Prepare recipe.

Process numbers#

Convert currencies#

One handy processor provides the ability to convert currencies based on a date column.

This time, you won’t add the step by clicking on a column header. Instead, you’ll search the processors library which includes about 100 data preparation processors.

  1. At the bottom of the Script panel, click + Add a New Step.

  2. Search for Convert currencies.

  3. Enter purchase_amount as the column.

  4. Select USD as the input currency.

  5. Select EUR as the output currency.

  6. For input date source, select From Column (Date), and enter purchase_date as the date column.

  7. Name the output column purchase_amount_eur.

Dataiku screenshot of converting currency in a Prepare recipe.

Round numbers#

Let’s also round the original purchase_amount values for simplicity.

  1. Click on the column header for purchase_amount.

  2. Click on Round to integer.

  3. Increase the Decimal places to 2.

Dataiku screenshot of rounding numbers in a Prepare recipe.

Tip

In the Script panel, try to disable/enable this step with the power (Disable step icon.) icon. Dataiku processes the steps sequentially. So, rounding the amounts after the conversion step doesn’t have any impact on the purchase_amount_eur col.

Process text#

Simplify text#

When working with a text or natural language column, a good starting point is often to normalize the text. When you do so, Dataiku transforms all text to lowercase, removes punctuation and accents, and performs Unicode normalization.

  1. Click on the column header for product_title.

  2. Click Simplify text.

  3. Leave the default option to normalize the text.

Dataiku screenshot of the Simplify text step in a visual analysis.

Extract text with a regular expression#

You can manipulate string data with regular expressions in many places throughout Dataiku, including in data preparation. As an example, let’s use a regular expression to extract into a new column the name of the first match of common Apple products.

  1. At the bottom of the Script panel, click + Add a New Step, and search for Extract with regular expression.

  2. In the step configuration, enter:

    • product_title as the input column.

    • apple as the prefix for the output column.

    • \b(apple|macbook|ipad|iphone|ipod)\b as the regular expression.

  3. As this regular expression doesn’t impact all rows, to see the impact, filter the column:

    • From the apple1 column header dropdown, select Filter.

    • In the filter box, check all values except No value to exclude rows that are empty.

  4. Once you’ve reviewed the apple1 column, delete the filter by clicking the bin (Numerical variable icon.) icon next to it.

Dataiku screenshot of regex extraction step in a Prepare recipe.

See also

Here, we’ve provided the regular expression for you, but you can explore how to use the smart pattern builder on your own.

Use a formula#

Finally, let’s use a Formula step to create a binary column named purchase_weekend. Dataiku formulas are a powerful expression language to perform calculations, manipulate strings, and much more.

  1. At the bottom of the Script panel, click + Add a New Step.

  2. Search for Formula.

  3. In the formula editor panel, enter:

    • purchase_weekend in the Formula for field.

    • if(purchase_dow>5,1,0) in the Formula field. Note that the editor autocompletes column names and shows you a preview of the output.

  4. Click Apply.

Dataiku screenshot of a formula in a Prepare recipe.

Tip

As you can see, purchase_weekend is:

  • 0 (false) when the day of the week stored in the purchase_dow column ranges from 1 to 5, as 1 corresponds to Monday and 5 to Friday.

  • 1 (true) when the day of the week is either 6 or 7 (Saturday and Sunday, respectively).

Group steps together#

Dataiku lets you group steps to organize the Script panel. Here, group together the steps used to process the dates.

  1. In the Script panel, select the first five steps working on the date columns.

  2. On top of the step list, click Actions > Group.

  3. Click the pencil (Pencil icon.) icon to name the group Process dates.

Dataiku screenshot showing how to group steps in a Prepare recipe.

Run the recipe#

Once you’ve defined all the steps in the script, it’s time to apply the script on the whole dataset and explore the output dataset.

  1. Click Run in the lower-left corner of the page (or use the keyboard shortcut @+r+u+n).

  2. When the job completes, click Explore dataset tx_prepared to view the output dataset. You can also return to the Flow and see your progress.

Next steps#

Congratulations on completing your first Prepare recipe! This tutorial only shows a small subset of what you can do in a Prepare recipe.

See also

For more information on data preparation, see also:

Now, you can still do more data exploration and preparation using other visual recipes.

Let’s dive deeper by grouping some data on a dataset.