Become a Master of Dataiku DSS Formulas¶
Introduction¶
Dataiku DSS includes a language to write formulas, much like a spreadsheet.
This How-To provides concrete examples of formulas that can help you in addition to the reference documentation.
Formulas can be used in the Prepare recipe:
To create new columns, filter rows, or flag rows
More generally, Formulas can be used to filter rows in many places in DSS:
In the Filtering recipe, to filter rows
In Machine Learning, to define the extracts to use for train and test set
In the Python and Javascript APIs, to obtain partial extracts from the datasets
In the Public API, to obtain partial extracts from the datasets
In the Group, Window, Join and Stack recipes, to perform pre and post filtering
Basic usage¶
Formulas define an expression that is applied row by row.
Assuming that you have a dataset with columns N1 (numeric), N2 (numeric), and S (string), here are a few example formulas:
2 + 2
N1 + N2
min(N1, N2)
# Returns the smaller of N1 and N2replace(S, 'old', 'new')
# Returns the value of S with ‘old’ replaced by ‘new’if (N1 > N2, 'big', 'small')
# Returns big is N1 > N2, small otherwise
We recommend that you start by reading the introduction paragraphs of the reference documentation before continuing this How-To.
The reference guide has information about typing, arrays, objects, …
Fun with arrays and objects¶
DSS lets you manipulate columns containing arrays ([0,1,2]
) or objects ({"firstname": "John", "lastname" :"Smith"}
).
Arrays and objects are represented using JSON notation. The formula language includes a lot of tools to manipulate arrays and objects.
In the following section, we’ll assume we have an input that looks like this:
client_details |
request_details |
stra |
numa |
---|---|---|---|
{“ip_address” : “1.2.3.4”, “visitor_id” : “bdc456ef”} |
{“timings” : [1232, 2034, 2351], “page_data”:
|
[“Big”, “Medium”, “Small”] |
[0, 1, 2, 5] |
Access elements¶
Formulas support accessing array elements and object keys using the traditional Python/Javascript syntax:
`array[0]`
`object["key"]`
`object.key`
(only valid if ‘key’ is a valid identifier, i.e. matches`[A-Za-z0-9_]*`
)
Note
Important notes
Array and object columns are not parsed by default in DSS. They are only automatically parsed in functions that use arrays. However, index and key access are not functions, so we’ll need to use `parseJson`
to convert our columns to objects.
You’ll find more information in the Reference documentation for formulas.
Thus:
`parseJson(request_details)["page_data"]`
: Extract the page_data array`parseJson(request_details)["page_data"][1]`
: Extract the second element of the page_data array`parseJson(request_details)["page_data"][1]["url"]`
: Extract the URL of the second element of the page_data array
Use ‘with’ to simplify an expression¶
The `with`
control lets you bind a variable to an expression, to simplify an expression. For example, if we wanted to sum the first two ranks, we’d need: `parseJson(request_details)["page_data"][0]["rank"] + parseJson(request_details)["page_data"][1]["rank"]`
.
But using `with`
, it simply becomes:
# Alias parseJson(request_details)["page_data"] as "pages"
with(parseJson(request_details)["page_data"], pages,
pages[0]["rank"] + pages[1]["rank"]
)
Filter an array¶
The filter
control allows you to filter array elements based on a predicate:
The syntax is : filter(array a, variable v, expression e) (returns: array)
Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array if the result is truish.
Examples:
filter(stra, v, v.length == 3)
: Returns [“Big”]filter(numa, v, v < 3)
: Returns [0, 1, 2]
Since filter is an array-expecting function, we don’t need to parseJson here.
A bigger example: we want expression that returns the number of pages with rank >= 2.
# Here we need parseJson because we access an item of the object
length(filter(parseJson(request_details)["page_data"], v, v.rank >= 2))
# Using object notation (but not for filter)
filter(parseJson(request_details)["page_data"], v, v.rank >= 2).length()
Apply a function¶
The formula language includes the forEach
and forEachIndex
functions to return a new array after applying a function to each element (like the Javascript “map”).
The syntax is: forEeach(array a, variable v, expression e) (returns: array)
Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array.
An example use case: if you have an array A ["1", "5", "12"]
and want the sum. The sum()
function only accepts numbers. We can use forEach this way:
sum(forEach(A, v, v.toNumber()))
# With object notation. Remember: filter and forEach can't use object notation
forEach(A, v, v.toNumber()).sum()
A complete example¶
Let’s say we want to return the length of the concatenation with ‘–’ of all URLs for which rank is 2
(ie, the length of “/u2–/u3”)
We therefore need to:
Access the page data array
Filter the array
Apply a function to get an array of URLs
Join the array
Get its length
# First way to write, using functions
# We choose to use 'with' to reduce a bit the size of the formula
with(parseJson(request_details)["page_data"], pd,
length( join( forEach( filter(pd, f, f.rank == 2), e, e["url"] ) , "--") )
)
# Using object notation:
# Remember: filter and forEach can't use object notation
with(parseJson(request_details)["page_data"], pd,
forEach(
filter(pd, f, f.rank == 2),
e, e.url
)
).join("--").length()
Dates handling¶
In addition to the vast date handling features of DSS data preparation, formulas add additional date processing facilities.
First of all, a note about typing:
Date-typed columns are not automatically parsed as “date” types in formulas (i.e., they are kept as strings)
However, all functions that expect a date argument will automatically try to convert from the regular Date format in DSS (
2015-01-02T03:04:05.234Z
)
In other words, if your column is already recognized as a valid Date
meaning, you should generally not need to use any specific parsing. If you happen to need one, use the asDate()
function
Other useful formulas¶
Use formula to ‘fill blanks from another column’¶
name |
alt_name |
---|---|
Z |
|
A1 |
|
E2 |
C3 |
To fill the “blanks” of “name” with the value of “alt_name”, create a Formula processor in data preparation.
Set the output column to be “name” (to overwrite the “name” column) and use this formula:
if (isBlank(name), alt_name, name)