Prepare data

Data cleaning and preparation is typically one of the most time-consuming tasks for anyone working with data. Let’s see how this critical step can be accomplished with visual tools in Dataiku.


You can use the Prepare recipe for data cleaning and feature generation in a visual and interactive way.

Parse dates

In a Prepare recipe, the inferred meanings for each column allow Dataiku to suggest relevant actions in many cases. Let’s see how this works for a date column.

  • With the training_data dataset selected, initiate a Prepare recipe from the Actions sidebar on the right.

  • The name of the output dataset, training_data_prepared, fits well so just create the recipe.

Dataiku recognizes the birth column as an unparsed date. The format though is not yet confirmed.

  • From the birth column header dropdown, select Parse date.

  • In the “Smart Date” dialog, click Use Date Format to accept the detected format.

  • Return to the birth column header dropdown, and select Delete.

  • From the birth_parsed column header dropdown, select Compute time since.

    • On the left side of the screen, change the “Output time unit” to Years.

    • Change the name of the “Output column” to age.


You deleted the birth column in the Prepare script above. However, this column remains in the input dataset. It will only be removed from the output dataset. Unlike a spreadsheet tool, this system of input and output datasets makes it easy to track what changes have occurred.

Handle outliers

The age column has some suspiciously large values. The same Analyze tool used to explore data can also be used to take certain actions in a Prepare recipe.

  • From the age column header dropdown, select Analyze.

  • In the Actions dropdown of the Outliers section, choose Clear rows outside 1.5 IQR to filter out rows with abnormally high and low ages.


In addition to the suggested actions for each column, you can also click +Add New Step at the bottom of the script on the left to search the processors library for more than 100 different transformations.

Resolve IP addresses

Let’s try out a few more processors to further enrich the dataset.

  • From the ip column header dropdown, select Resolve GeoIP.

  • Extract only the country, GeoPoint, and continent code.

  • Delete the original ip column.

  • Click Save.

Fuzzy join data

In the earlier Join recipe, the customer IDs in each input dataset needed to match exactly. In some cases though, the ability to join based on a “close” match is what you really need.

  • Return to the Flow, and open the country_gdp dataset to remember what it contains.

Country names or abbreviations are one example where fuzzy matching may be a solution. For example, we would be able to match “UK” in one dataset with “U.K.” in another.

  • From the Flow, double click to open the Prepare recipe.

  • At the bottom of the recipe script, click to +Add a New Step.

  • Select Fuzzy join with other dataset.

    • Add ip_country as the “Join column (here)”.

    • Add country_gdp as the “Dataset to join with”.

    • Add Country as the “Join column (in other dataset)”.

    • Add GDP_cap as the “Column to retrieve”.

    • Remove the prefix join_.

    • Select “Normalize text”.


This processor works fine for joining this small lookup table, but there is also a more powerful Fuzzy Join recipe that works with larger datasets.

Write a formula

You might also want to create new features from those already present. For this purpose, Dataiku has a Formula language, similar to what you would find in a spreadsheet tool.


Dataiku formulas are a powerful expression language available in many places on the platform to perform calculations, manipulate strings, and much more. This language includes:

  • common mathematical functions, such as round, sum and max;

  • comparison operators, such as >, <, >=, <=;

  • logical operators, such as AND and OR;

  • tests for missing values, such as isBlank() or isNULL();

  • string operations with functions like contains(), length(), and startsWith();

  • conditional if-then statements.

  • At the bottom of the recipe script, click to +Add a New Step.

  • Choose Formula.

  • Name the “Output column” high_gdp_cap.

  • Click Open Editor Panel to open the Formula editor.

  • Type the formula below, and click Apply when finished.

  • As this is the last step for now, click Run to execute the Prepare recipe and produce the output dataset.

if(numval("GDP_cap") > 40000, 1, 0)


This is only a small window into what can be accomplished in a Prepare recipe. See the Processors reference for a complete list.