Tutorial | Relative referencing in Dataiku formulas#
The ability to reference previous rows in a formula is beneficial for tasks such as computing growth rates, generating moving averages, and calculating changes between consecutive records.
When using a formula in the Prepare recipe, you can accomplish this using an optional offset argument.
Get started#
Objectives#
In this tutorial, you will:
Learn how to use an offset argument for various formula functions.
Apply these functions to calculate cumulative sums and compound inflation.
Prerequisites#
Dataiku 13.1 or later.
Create the project#
From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select Relative Referencing Formulas.
Click Install.
From the project homepage, click Go to Flow (or
g
+f
).
From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by Advanced Designer.
Select Relative Referencing Formulas.
From the project homepage, click Go to Flow (or
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
Offset syntax#
The optional offset argument is available for value access functions.
val("column", defaultValue, offset)
strval("column", defaultValue, offset)
numval("column", offset)
The offset must be a positive integer. It can only reference previous rows.
The number 1
references the row immediately before the current row. The number 2
represents two rows before the current row, and so on.
Enforce row order#
By default, Dataiku does not enforce row order. However, there are several methods available to ensure that your dataset has a set sequence.
Enable the Preserve ordering option in the Settings of a dataset. This is helpful when an uploaded dataset already has the correct row order.
Use the Sort recipe. This automatically enables the Preserve ordering option in the output dataset.
Sort the data externally in the database connection.
Note
Be aware that Dataiku cannot preserve the order of external datasets. For example, the Sort recipe will not define the order of a dataset in an SQL database.
However, you can request the data in a certain order using the ORDER BY
query. Sorting methods will vary for different connections.
Find the cumulative sum#
Assume that you want to know the cumulative sum of a customer’s purchases after each purchase.
You could do this in Excel using the following formula:
You could also use Dataiku’s formula processor in a Prepare recipe to accomplish the same task. Let’s find out how!
Sort the data#
First, we have to sort the data to ensure that a customer’s purchases appear consecutively. In this case, we’ll sort in ascending order.
Select the Sales_Data dataset.
From the Actions panel, create a Sort recipe.
Keep the defaults for the output dataset and click Create Recipe.
From the Select columns for sorting dropdown, select CustomerID.
Because we also want to sort invoices by date per customer, add the InvoiceDate column for sorting as well.
Run the recipe.
Write the formula#
Now that we have set the order of the dataset, let’s add a formula to calculate the cumulative sum of each customer’s purchases.
Select the Sales_Data_sorted dataset.
Create a Prepare recipe.
Rename the output dataset to
Sales_Data_prepared
, and then Create Recipe.Click + Add A New Step.
Choose the Formula processor.
Name the output column
Cumulative
.Open the Editor Panel and paste the following formula:
if (val("CustomerID")==val("CustomerID","",1), numval("PurchaseAmount")+numval("Cumulative",1), numval("PurchaseAmount"))
Review the sample output and Apply these changes.
Understand the output#
Let’s review the formula example and relate it to the output.
val("CustomerID")==val("CustomerID","",1)
This statement checks whether the customer ID of the current row matches the customer ID of the previous row. The empty quotation marks in
val("customerid","",1)
represent the default value, in this case an empty string. The number1
is the offset argument and indicates that the function should return the value of the previous row.numval("PurchaseAmount")+numval("Cumulative",1)
If the current customer ID matches the previous customer ID, then the previous cumulative total is added to the current purchase amount.
numval("PurchaseAmount")
If the current customer ID does not match the previous customer ID, then the current purchase amount equals the cumulative total for the row.
You’ll see that the result is the cumulative sum for each customer’s purchases:
Note
You can also use the Window recipe to find cumulative sums.
Format decimal places#
Assume that you only want values to the hundredths place in the Cumulative column, as they represent dollar values. You can format the values using the Dataiku formula format() function.
Add another formula processor step.
In the Output column field, type
Cumulative
.Paste the following formula into the Expression box:
format("%.2f", Cumulative)
Run the recipe.
Calculate compound inflation#
Because inflation compounds and changes over time, we need a way to calculate the current monetary values based on previous values.
We could handle this in Excel like so:
Now, let’s see how to do so in Dataiku!
Preserve row order#
This dataset is already in the correct order. Therefore, we should enable the Preserve ordering setting in the dataset.
Open the US_Inflation dataset.
Switch to the Settings tab.
In the Order section, select the Preserve ordering checkbox.
Save your changes.
Write the formula#
Now, we’ll add a formula to calculate the compound interest of savings over time.
Return to the Flow, and select the US_Inflation dataset.
Click on the Prepare recipe.
Keep the default settings, and then Create Recipe.
Click + Add A New Step.
Choose the Formula processor.
Name the output column
Value
.Open the Editor Panel and paste the following formula:
val("Value", 50, 1) * ( 1 + (val("Inflation Rate", 0, 1)/100) )
Apply the formula.
Run the recipe and review the output dataset.
Understand the output#
Once again, we’ll review the formula example and relate it to the output.
val("Value", 50, 1)
This function retrieves the previous value from the Value column relative to the current row. The number
50
is the default value for this function. We need this to successfully calculate the first row’s value, as it has no previous rows. In this case,50
equals the starting value of which we want to calculate inflation. You can change this to any value you want!( 1 + (val("Inflation Rate", 0, 1)/100) )
This phrase represents the growth factor per year. Again, we need a default value to populate the first row of the column. We can use
0
here, such that the final equation for the first row is:
In the output dataset, you can see what value 50 dollars in 1994 had in later years.
What’s next?#
To learn more about formula functions in Dataiku, see the Formulas page in the reference documentation.