Concept | Dataiku formulas#

Watch the video

Background#

Similar to what you might find in a spreadsheet tool like Excel, Dataiku has its own Formula language. It is a powerful expression language to perform calculations, manipulate strings, and much more. Formulas can be used throughout Dataiku and add power and flexibility to visual recipes.

A slide showing a few examples of various Dataiku formulas.

Formula processor#

One way to leverage formulas is to use the Formula processor in the Prepare recipe. From the processor library, you can add a Formula step and provide the name of the output column.

You could write simple formulas directly in the Expression box. However, opening the editor panel adds a few support measures. The first is code completion. As soon as you start typing, Dataiku starts suggesting columns from the dataset or functions to apply. The editor will also alert you if the formula is invalid.

A screenshot of the formula editor in the Prepare recipe.

Common operations#

The Formula language allows you to craft expressions of considerable complexity. For example, you can use:

  • Common mathematical functions, such as round, sum and max

  • Comparison operators, such as >, <, >=, <=

  • Logical operators, such as AND and OR

  • Tests for missing values, such as isBlank() or isNULL()

  • Conditional if-then statements

Common functions#

Here are some examples of useful Dataiku formula functions:

Function

Description

length()

Returns the length of a string. This is useful when we want to filter rows based on the length of our text data.

substring()

Extracts a substring from a text column by indicating the starting and ending position of the part that we want to extract.

rand()

Generates random integers between the minimum and maximum values we choose. This is useful for generating simulation data.

strval()

Allows us to access the values of columns, where the column name includes a space. Without the proper syntax, Dataiku displays a syntax error. To avoid this error, we wrap the column name in double quotations, and apply the “strval()” function. The value is returned as a string.

numval()

Evaluates the column name with spaces and returns the column value as a number.

Use case: Flag large credit card purchases#

Let’s say we have a credit card transaction dataset where we want to compare the amount of a given transaction to the average purchase amount for the credit card it was made with.

This could be useful for fraud detection. For example, if a card holder makes a disproportionately expensive purchase compared to the regular use of their credit card, we could flag the purchase as potentially fraudulent.

If we added a Formula step to create a new column named higher_than_avg_purchase, we could open the formula editor and write an expression to check whether the value of the purchase_amount column is higher than that of the card_purchase_amount_average column. We could use the formula to do so:

  • if(purchase_amount > card_purchase_amount_avg, "yes", "no")

A Dataiku screenshot of the if statement formula expression.

In this case, each transaction would be flagged as being more expensive than the card average or not.

What’s next?#

You just learned how to use Dataiku’s spreadsheet-like formula language to perform calculations, manipulate strings, and much more. Next, try using the formula language in other visual recipes in Dataiku!

See also

See the Formula language page in the reference documentation.

You can continue getting to know the basics of Dataiku by following the Tutorial | Prepare recipe.