Concept | Dataiku formulas#

Watch the video

Background#

Similar to what you might find in a spreadsheet tool like Excel, Dataiku has its own Formula language. It’s a powerful expression language to perform calculations, manipulate strings, and much more. You can use Formulas throughout Dataiku, adding power and flexibility to visual recipes.

A slide showing a few examples of various Dataiku formulas.

Formula processor#

One way to leverage formulas is to use the Formula processor in the Prepare recipe. From the processor library, you can add a Formula step and provide the name of the output column.

You could write simple formulas directly in the Expression box. However, opening the editor panel adds a few support measures. The first is code completion. As soon as you start typing, Dataiku starts suggesting columns from the dataset or functions to apply. The editor will also alert you if the formula is invalid.

A screenshot of the formula editor in the Prepare recipe.

Common operations#

The Formula language allows you to craft expressions of considerable complexity. For example, you can use:

  • Common mathematical functions, such as round, sum and max

  • Comparison operators, such as >, <, >=, <=

  • Logical operators, such as AND and OR

  • Tests for missing values, such as isBlank() or isNULL()

  • Conditional if-then statements

Common functions#

Here are some examples of useful Dataiku formula functions:

Function

Description

length()

Returns the length of a string. This is useful when you want to filter rows based on the length of text data.

substring()

Extracts a substring from a text column by indicating the starting and ending position of the part that you want to extract.

rand()

Generates random integers between the minimum and maximum values you choose. This is useful for generating simulation data.

strval()

Allows for accessing the values of columns, where the column name includes a space. Without the proper syntax, Dataiku displays a syntax error. To avoid this error, wrap the column name in double quotations, and apply the strval() function. The value is returned as a string.

numval()

Evaluates the column name with spaces and returns the column value as a number.

Use case: Flag large credit card purchases#

Let’s say you have a credit card transaction dataset where you want to compare the amount of a given transaction to the average purchase amount for the credit card.

This could be useful for fraud detection. For example, if a card holder makes a disproportionately expensive purchase compared to the regular use of their credit card, you could flag the purchase as potentially fraudulent.

If you added a Formula step to create a new column named higher_than_avg_purchase, you could open the formula editor and write an expression to check whether the value of the purchase_amount column is higher than that of the card_purchase_amount_average column. You could use the following formula to do so:

  • if(purchase_amount > card_purchase_amount_avg, "yes", "no")

A Dataiku screenshot of the if statement formula expression.

In this case, each transaction would be flagged as being more expensive than the card average or not.

Next steps#

You just learned how to use Dataiku’s spreadsheet-like formula language to perform calculations, manipulate strings, and much more. Next, try using the formula language in other visual recipes in Dataiku!

See also

See the Formula language page in the reference documentation.

You can continue getting to know the basics of Dataiku by following the Tutorial | Prepare recipe.

Tip

You can find this content (and more) by registering for the Dataiku Academy course, Visual Recipes. When ready, challenge yourself to earn a certification!