Prepare Recipe¶
The Prepare recipe is a visual recipe that deserves special attention due to its extensive library of processors steps.
Note
The Prepare recipe is the focus of this section, but most of these materials also apply to the visual analysis of the Lab, which can be deployed to the Flow as a Prepare recipe.
Reference | Performing joins in the Prepare recipe¶
Dataiku supports a number of different ways to join datasets.
In terms of non-code options, the Join recipe is a natural choice. However, simple left-joins on relatively small datasets can be performed using a Prepare recipe or visual analysis.
In addition, the Prepare recipe can also execute two other kinds of powerful joins: a fuzzy join and a geo-join.

Left-join for small datasets¶
The processors library includes an option to perform a memory-based left-join.
In the Prepare recipe below, we join the columns gender and ip_address from the Customers dataset to the Orders dataset.

Fuzzy join¶
Note
There is also a dedicated Fuzzy Join recipe. See its documentation or this hands-on tutorial to learn more.
The Fuzzy join processor performs a join where the values of the join key do not need to match precisely, but are “close” according to the Damerau-Levenshtein distance.
For example, you may want to enrich a customer database with information on the economies of the countries where your customers live. Using the country names as the key in a traditional join may be difficult, because the country names in one dataset may not exactly match the country names in the other, either because of typos or differences in use of the official or common name of a country. A fuzzy join can help you to successfully merge this data.
With Damerau-Levenshtein distance of 1, United States will match with the typo Untied States, but a distance of 2 is required to match more typos like Untied Stats, and a greater distance to match United States of America. Of course, the difficulty at that point is that increasing the Damerau-Levenshtein distance to match United States and United States of America can also create a match with United Kingdom, so you may need to recode values.

In addition to specifying the Damerau-Levenshtein distance, you can also apply string operations like normalizing the text or stemmings words before attempting the join.
Geo-join¶
Note
There is also a dedicated Geo join recipe. See the reference documentation or this tutorial to learn more.
The Geo-join processor performs a geographic nearest-neighbor join between two datasets with latitude-longitude coordinates.
For example, you may have determined the locations of customers from their IP addresses. You can enrich this data with information about the closest airport to each customer by using a Geo-join that finds the airport with the shortest distance based on geographic location.
The latitude and longitude of the locations to be joined are used as the keys, and must be available in both datasets. If you have a GeoPoint, you can first apply the Extract lat/lon from GeoPoint processor. You can select which columns from the other dataset should be copied to this one, optionally including the distance between the joined locations.

Reference | Using custom Python functions in the Prepare recipe¶
The Prepare recipe holds processors to handle the most common data preparation tasks. For more customized tasks, you can craft operations in the Formula language.
For even more complex tasks, or according to your preference, you can also use the Python function processor.
This processor allows you to perform complex row-wise operations on your dataset. The output of the function depends on which of the three modes is used:
Mode |
Returns |
---|---|
Cell |
A new cell for each row |
Row |
A row for each row |
Rows |
A list of rows for each row |

Cell mode¶
In cell mode, Dataiku expects a single column as output. This is useful for finding the largest value across columns, or the sum of a set of columns whose names follow some pattern.
In the example below, the process()
function finds the longest string value across the columns in the dataset.
def process(row):
# In 'cell' mode, the process function must return
# a single cell value for each row
# The 'row' argument is a dictionary of columns of the row
max_len = -1
longest_str = None
for val in row.values():
if val is None:
continue
if len(val) > max_len:
max_len = len(val)
longest_str = val
return longest_str

Row mode¶
In row mode, Dataiku expects a row as output, which replaces the existing row. This is useful when you need to perform operations in place on the rows (as in the image below), or when you need to add multiple rows to the dataset (such as if you compute the mean and standard deviation of values in a row across columns).
In the example below, the process()
function adds 5 to the integer-valued columns. Note that the row
parameter passed to the process()
function is a dictionary of the string values, so we use the ast
package to evaluate whether they are integer-valued.
import ast
def process(row):
# In 'row' mode, the process function must return the full row.
# You may modify the 'row' in place to
# keep the previous values of the row.
for i in row.keys():
try:
isint = type(ast.literal_eval(row[i])) is int
except:
isint = False
if isint:
row[i] = int(row[i]) + 5
return row

Rows mode¶
In rows mode, Dataiku expects one or more rows as output, which replace the existing row. This is especially useful when you need to transform your data from wide to long format.
In the example below, the process()
function splits each row into two. Where before a single row contained both a person’s work and home zip code and state, now each row contains either the home or work information, along with a new column that indicates whether it is the home or work information.
# Modify the process function to fit your needs
def process(row):
# In 'multi rows' mode, the process function
# must return an iterable list of rows.
ret = []
home = {"name": row["name"],
"type": "home",
"zip": row["home_zip"],
"state": row["home_state"]}
work = {"name": row["name"],
"type": "work",
"zip": row["work zipcode"],
"state": row["work state"]}
ret.append(home)
ret.append(work)
return ret

Reference | Handling decimal notations¶
When preparing data, you often encounter numeric data in a variety of formats from around the world.
This brief tutorial introduces how Dataiku handles conversion of decimal notations into a universally-understood raw format.
Decimal notations¶
Many parts of the world commonly display large and decimal numbers as 1,234,567.89
. However, this same number, depending on the country, might be more commonly written as:
1234567.89
1234567,89
1 234 567,89
And many other ways
Since Dataiku needs to assist different systems in talking to each other, and those systems may not have the same opinions, Dataiku only treats “computer-notation” numbers as decimals, out of the box.
Thus, both for the float
and double
storage types, and for the Decimal
meaning, Dataiku will only accept the following kind of notation:
1234567.89
1.23456789E6
-1234.33
…
Note
You might want to re-read our documentation about storage types and meanings.
While Dataiku could recognize more forms, other systems, such as Hive, would not, and that would cause various inconsistencies.
Thus, for example, Dataiku will recognize 1,234,567.89
as a string, not a number.
Normalizing in a Prepare recipe¶
You can use a Prepare script (either in a visual analysis or a recipe) to handle datasets with various kinds of numeric representations. In particular, this is a job for the Convert number formats processor.
Here is a snippet of a dataset in a visual analysis containing decimals formatted in both US and French styles.

For the us_notation column, Dataiku predicts a meaning of “Decimal”, but the first two values are invalid. On the other hand, Dataiku predicts a meaning of “Decimal (comma)” for the fr_notation column. Our goal is for Dataiku to recognize both of these columns as valid decimals.
For the fr_notation column, Dataiku suggests a conversion from the French decimal format to a regular decimal. This steps uses the Convert number formats processor to convert this column to a Decimal meaning.

The same processor can fix the us_notation column. Add a new step to the script and find the Convert number formats processor. The input format should be recognized as “English” and the output format set to “Raw”.

Now Dataiku recognizes all values of both output columns with a Decimal meaning, and can be processed as such by all Dataiku-supported compute engines.
How-to | Normalize number formats in a Prepare recipe¶
You can use a Prepare script (either in a visual analysis or a recipe) to handle datasets with various kinds of numeric representations. In particular, this is a job for the Convert number formats processor.
Here is a snippet of a dataset in a visual analysis containing decimals formatted in both US and French styles.

For the us_notation column, Dataiku predicts a meaning of “Decimal”, but the first two values are invalid. On the other hand, Dataiku predicts a meaning of “Decimal (comma)” for the fr_notation column. Our goal is for Dataiku to recognize both of these columns as valid decimals.
For the fr_notation column, Dataiku suggests a conversion from the French decimal format to a regular decimal. This steps uses the Convert number formats processor to convert this column to a Decimal meaning.

The same processor can fix the us_notation column. Add a new step to the script and find the Convert number formats processor. The input format should be recognized as “English” and the output format set to “Raw”.

Now Dataiku recognizes all values of both output columns with a Decimal meaning, and can be processed as such by all Dataiku-supported compute engines.
How-to | Handle accounting-style negative numbers¶
In Accounting, it is common to represent negative numbers with leading and trailing parentheses. For example, (200)
equals -200
.
A “Find and replace” step in the Prepare recipe can transform a string column with values like (200)
into a numeric column with values like -200
.
Find and replace step¶
In a “Find a replace” step,
Choose the column with the accounting-style values.
Replace the open parenthesis with a hyphen (minus sign).
Replace the closed parenthesis with no value.

Note
The storage type of the original column is a string, but the newly-created column is a double.
In this example, only one column is being converted. Changing the Column setting from “single” to “multiple” would allow you to convert multiple columns in the same step.
How-to | Copy-paste Prepare recipe steps¶
Often you may want to reuse the steps of a Prepare recipe in another location. Of course, it is possible to copy the entire Prepare recipe, as with any visual recipe. However, it is also possible to copy any number of individual Prepare recipe steps without recreating the entire recipe.
First, select the needed steps from the Prepare recipe script and copy them using the Actions menu at the script’s top left.

Here we have copied all ten steps from this Prepare recipe for use elsewhere.¶
Once you have copied the steps, you can paste them into another Prepare recipe in the same project, a different project, or even another project on a different instance. Execute the paste using a keyboard shortcut or Paste after selection from the script Actions menu.
The video below also walks through this process in detail:
What’s next?¶
See these detailed instructions for copying Prepare recipe steps in the reference documentation.
How-to | Apply Prepare steps to multiple columns¶
In many cases when preparing data, you may want to apply the same operation to multiple columns.
This can be achieved in a Prepare recipe or visual analysis in a few different ways, depending on the type of operation and type of data at hand.
From the columns view¶
In the Columns View, you can select multiple columns and choose an operation from the Actions menu to apply to the selected columns. This creates a new step in the Script, and is essentially a shortcut to manually re-creating the steps individually. It is especially useful for processors that cannot be extended to multiple columns in the step editor.
For example, you can apply this strategy to rename multiple columns at once. You may have several columns that relate to a customer’s home contact information. In order to help differentiate this from their work contact information, you can select the needed columns from the Columns View, choose Rename from the Actions menu, and then add home_
as a prefix to all of the selected columns.
As shown in the video below, the new step in the script is a Rename processor that renames each column individually.
The processors that can be applied to multiple columns in this manner depend upon the data type of columns selected:
Selecting all text columns would suggest applying string transformations.
Selecting all numeric columns could suggest filtering by a numeric range.
Selecting a mixture of different data types might suggest removing invalid rows or clearing invalid cells based on meaning.
From the step editor¶
Another way to apply a step to multiple columns is through the step editor. Many processors have several modes for specifying the column.
For example, you may want to fill empty cells of a column with a fixed value, such as Unknown
. In the Fill empty cells with fixed value processor, there are a number of options for selecting columns.

Single: Specify a single column by name, and it fills empty cells in that column. In the example below, home_zip is the selected column.
Multiple: Specify a list of columns by name, and it fills empty cells in each of those columns. In the example below, home_zip and work zipcode are the selected columns.
Pattern: Specify a regular expression. The processor operates on all columns that match the pattern. In the example below, the pattern
.*zip.*
selects home_zip and work zipcode.All: The processor operates on all columns.
The video below demonstrates these differences.
Other examples of processors that can be applied in this way include processors to impute missing values, flag or filter rows, and rounding or converting number formats.
Within a Python function¶
It is also possible to process multiple columns using a Python function. This is especially useful for processors that cannot be extended to multiple columns within the step editor.
In the example below, the process
function adds 5 to the integer-valued columns.
import ast
def process(row):
# In 'row' mode, the process function must return the full row.
# You may modify the 'row' in place to
# keep the previous values of the row.
for i in row.keys():
try:
isint = type(ast.literal_eval(row[i])) is int
except:
isint = False
if isint:
row[i] = int(row[i]) + 5
return row

How-to | Standardize text fields using fuzzy values clustering¶
When working with large amounts of disparate, user-entered text data, we often need to standardize or collapse entries into a resolved form.
For example, how can we get a computer to recognize that strings like “Abraham Lincoln”, “Abe Lincoln”, and “Abrahm Lincoln” are actually the same category? We want to map these close variations into a smaller number of clean labels.
In the image below, the emp_title column has thousands of distinct values and a long tail of unique values (those that appear only once).

Text cleaning strategies like normalization that standardize on case and remove punctuation may help in this regard, but they don’t get us all of the way there. Similarly, if it is user-entered data, you might not be able to introduce a rule-based strategy using regular expressions.
Fuzzy matching¶
The answer lies in a technique called fuzzy matching.
When viewing a dataset in the Explore tab, select Analyze from the context menu of the column header to review the number of distinct and unique values and their distribution.

If you switch over from the Categorical tab to the Values Clustering tab, you have the option to search for clusters.
You can choose a clustering strategy of “Fuzzy” or “Highly fuzzy” depending on how close you require values in the cluster to be. In this case, we’ll choose “Fuzzy”.
You can also compare words or characters. For this use case, we’ll compare words.
If you repeat these steps in a Prepare recipe, you can take action on these clusters.
You can adjust the final, standardized label if needed. Then, you can click “Merge Selected” to perform a merge on all or a selection of clusters so that the original values are mapped to the new cluster label.

To take the first example, all of these 5 variations will now have the same label– “Dept of Homeland Security”.
This merge action creates a fully-transparent and editable Find and Replace step in the Prepare recipe with hundreds - or even thousands - of individual replacements.
Super simple and a huge time saver!
How-to | Reshape data from wide to long format¶
You can use the Pivot recipe to reshape data from long to wide format. However, if initially presented with data in a wide format, you can “unpivot” the data from wide to long format using the Prepare recipe processor Fold multiple columns (or Fold multiple columns by pattern).
Consider a dataset with the following structure:

To reshape this dataset, so that the *_total_sum
columns are folded into one total_sum column with one row per year:
In a Prepare recipe, click + Add a New Step.
Choose Fold multiple columns by pattern.
For the field Columns to fold pattern, supply a regular expression that matches which columns should be folded.
For the Column for fold name field, provide a name for the new column holding the row labels (in this case year).
For the Column for fold value field, provide a name for the new column holding the cell values (in this case total_sum).
Check the box Remove folded columns to delete the folded columns from the schema of the output dataset.

Note
You can find another example of this processor being used in the reference documentation.