Prepare data¶
Data cleaning and preparation is typically one of the most time-consuming tasks for anyone working with data. Let’s see how this critical step can be accomplished with visual tools in Dataiku.
Note
You can use the Prepare recipe for data cleaning and feature generation in a visual and interactive way.
Parse dates¶
In a Prepare recipe, the inferred meanings for each column allow Dataiku to suggest relevant actions in many cases. Let’s see how this works for a date column.
With the training_data dataset selected, initiate a Prepare recipe from the Actions sidebar on the right.
The name of the output dataset,
training_data_prepared
, fits well so just create the recipe.
Dataiku recognizes the birth column as an unparsed date. The format though is not yet confirmed.
From the birth column header dropdown, select Parse date.
In the “Smart Date” dialog, click Use Date Format to accept the detected format.
Return to the birth column header dropdown, and select Delete.
From the birth_parsed column header dropdown, select Compute time since.
On the left side of the screen, change the “Output time unit” to Years.
Change the name of the “Output column” to
age
.
Tip
You deleted the birth column in the Prepare script above. However, this column remains in the input dataset. It will only be removed from the output dataset. Unlike a spreadsheet tool, this system of input and output datasets makes it easy to track what changes have occurred.
Handle outliers¶
The age column has some suspiciously large values. The same Analyze tool used to explore data can also be used to take certain actions in a Prepare recipe.
From the age column header dropdown, select Analyze.
In the Actions dropdown of the Outliers section, choose Clear rows outside 1.5 IQR to filter out rows with abnormally high and low ages.
Tip
In addition to the suggested actions for each column, you can also click +Add New Step at the bottom of the script on the left to search the processors library for more than 100 different transformations.
Resolve IP addresses¶
Let’s try out a few more processors to further enrich the dataset.
From the ip column header dropdown, select Resolve GeoIP.
Extract only the country, GeoPoint, and continent code.
Delete the original ip column.
Click Save.
Fuzzy join data¶
In the earlier Join recipe, the customer IDs in each input dataset needed to match exactly. In some cases though, the ability to join based on a “close” match is what you really need.
Return to the Flow, and open the country_gdp dataset to remember what it contains.
Country names or abbreviations are one example where fuzzy matching may be a solution. For example, we would be able to match “UK” in one dataset with “U.K.” in another.
From the Flow, double click to open the Prepare recipe.
At the bottom of the recipe script, click to +Add a New Step.
Select Fuzzy join with other dataset.
Add
ip_country
as the “Join column (here)”.Add country_gdp as the “Dataset to join with”.
Add
Country
as the “Join column (in other dataset)”.Add
GDP_cap
as the “Column to retrieve”.Remove the prefix
join_
.Select “Normalize text”.
Tip
This processor works fine for joining this small lookup table, but there is also a more powerful Fuzzy Join recipe that works with larger datasets.
Write a formula¶
You might also want to create new features from those already present. For this purpose, Dataiku has a Formula language, similar to what you would find in a spreadsheet tool.
Note
Dataiku formulas are a powerful expression language available in many places on the platform to perform calculations, manipulate strings, and much more. This language includes:
common mathematical functions, such as round, sum and max;
comparison operators, such as
>
,<
,>=
,<=
;logical operators, such as
AND
andOR
;tests for missing values, such as
isBlank()
orisNULL()
;string operations with functions like
contains()
,length()
, andstartsWith()
;conditional if-then statements.
At the bottom of the recipe script, click to +Add a New Step.
Choose Formula.
Name the “Output column”
high_gdp_cap
.Click Open Editor Panel to open the Formula editor.
Type the formula below, and click Apply when finished.
As this is the last step for now, click Run to execute the Prepare recipe and produce the output dataset.
if(numval("GDP_cap") > 40000, 1, 0)
Tip
This is only a small window into what can be accomplished in a Prepare recipe. See the Processors reference for a complete list.