Prepare Data

Data cleaning and preparation is typically one of the most time-consuming tasks for anyone working with data. Let’s see how this critical step can be accomplished with visual tools in Dataiku DSS.

Note

You can use the Prepare recipe for data cleaning and feature generation in a visual and interactive way.

Parse Dates

In a Prepare recipe, the inferred meanings for each column allow Dataiku DSS to suggest relevant actions in many cases. Let’s see how this works for a date column.

  • With the training_data dataset selected, initiate a Prepare recipe from the Actions bar on the right.

  • The name of the output dataset, training_data_prepared, fits well so just create the recipe.

Dataiku DSS recognizes the birth column as an unparsed date. The format though is not yet confirmed.

  • From the birth column header dropdown, select Parse date.

    • In the “Smart date” dialog, click OK to accept the detected format (or Use Date Format if using Dataiku Online).

  • Return to the birth column header dropdown, and select Delete.

  • From the birth_parsed column header dropdown, select Compute time since.

    • Change the “Output time unit” to Years.

    • Change the name of the “Output column” to age.

Tip

You deleted the birth column in the Prepare script above. However, this column remains in the input dataset. It will only be removed from the output dataset. Unlike a spreadsheet tool, this system of input and output datasets makes it easy to track what changes have occurred.

Handle Outliers

The age column has some suspiciously large values. The same Analyze tool used to explore data can also be used to take certain actions in a Prepare recipe.

  • From the age column header dropdown, select Analyze.

  • In the Outliers section, choose Clear rows outside 1.5 IQR from the Actions menu.

Tip

In addition to the suggested actions for each column, you can also directly search the processors library for more than 90 different transformations commonly needed in data wrangling by clicking +Add New Step at the bottom of the script.

Resolve IP Addresses

Let’s try out a few more processors to further enrich the dataset.

  • From the ip column header dropdown, select Resolve GeoIP.

  • Extract only the country, GeoPoint, and continent code.

  • Delete the original ip column.

  • Click Save and Update Schema.

Fuzzy Join Data

In the earlier Join recipe, the customer IDs in each input dataset needed to match exactly. In some cases though, the ability to join based on a “close” match is what you really need.

  • Return to the Flow, and open the country_gdp dataset to remember what it contains.

Country names or abbreviations are one example where fuzzy matching may be a solution. For example, we would be able to match “UK” in one dataset with “U.K.” in another.

  • From the Flow, double click to open the Prepare recipe.

  • At the bottom of the recipe script, click to +Add a New Step.

  • Search for fuzzy.

  • Select Fuzzy join with other dataset.

    • Add ip_country as the “Join column (here)”.

    • Add country_gdp as the “Dataset to join with”.

    • Add Country as the “Join column (in other dataset)”.

    • Add GDP_cap as the “Column to retrieve”.

    • Remove the prefix join_.

    • Select “Normalize text”.

Tip

As of version 9.0, there is also a more powerful Fuzzy Join recipe that works with larger datasets.

Write a Formula

You might also want to create new features from those already present. For this purpose, Dataiku DSS has a Formula language, similar to what you would find in a spreadsheet tool.

Note

Dataiku DSS formulas are a powerful expression language available in many places on the platform to perform calculations, manipulate strings, and much more. This language includes:

  • common mathematical functions, such as round, sum and max;

  • comparison operators, such as >, <, >=, <=;

  • logical operators, such as AND and OR;

  • tests for missing values, such as isBlank() or isNULL();

  • string operations with functions like contains(), length(), and startsWith();

  • conditional if-then statements.

  • At the bottom of the same recipe script, click to +Add a New Step.

  • Choose Formula.

  • Name the “Output column” high_gdp_cap.

  • Click Open Editor Panel to open the Formula editor.

  • Type the formula below, and click Apply when finished.

  • As this is the last step for now, click Run and Update Schema to execute the Prepare recipe and produce the output dataset.

if(numval("GDP_cap") > 40000, 1, 0)

Tip

This is only a small window into what can be accomplished in a Prepare recipe. See the product documentation for a complete view.