Safe sums across columns in Dataiku DSS Formulas

People will usually tell you that 1+1=2. In general, you can believe them; however, in some cases in Dataiku DSS, the following may also be true from time to time : 1+1=11.

🤯 🤔 “What in the name of the bird?!

TL;DR

Why might these two numbers be concatenating when they’re not strings? That’s because there are 3 things you need to know about operations in the Formulas editor:

  • Be careful when using + or -: only do it if your data does not contain any null values.

  • Use the sum() function instead: sum([col_A, col_B, col _C, col_D]) ← use those brackets 👌

  • Or use a nested IF: if(isnull(col_A),0,col_A)+if(isnull(col_B),0,col_B) ← tedious, but also effective

  • Or you can replace if(isnull(A),0,A) with coalesce(A,0)

When do 1+1=11 and 1+1=2 both happen?

I was recently playing with a dataset in which I needed to create a sum across columns, and I found something that astonished me with a formula that went A+B+C+D :

../../../_images/safe-sums-1.png

“But all my data is stored as integers! What is the logic behind this add/concatenate behavior?”

Why?

🧐 Let’s see what’s happening:

1+2+3+4=10 ← add 1+2 , add 3, add 4

null+2+3+4=234 ← (null+2), then concatenate 3, concatenate 4

1+null+3+4=134 ← (1+null), then concatenate 3, concatenate 4

1+2+null+4=34 ← add 1+2, then (+ null), then concatenate 4

1+2+3+null=6 ← add 1+2 , add 3, (+null) then…

See the pattern of how addition turns into concatenation? It always happens after a null value. Why? The short answer is, “It’s complicated”.

Switches from Add → Concatenate

Let’s K.I.S.S. the long answer:

Auto-detection for operations occurs by value and not by column.

The detection of the column datatype occurs after all the operations have been completed.

Concatenate always follows null + anything. It doesn’t matter that all of the data in your columns are numbers, or that they’re stored as a number data type, or that logically 1+nothing+3+4 should be 8, not 134.

What happened in the case of the lovely A+B+C+D above is that null values are turned into “”. Whenever the processor sees anything with a + “” it defaults to concatenation and turns the result into a string.

../../../_images/safe-sums-2.png

Another Example

In the following example, everything works according to how we’d expect it to because G is the result of several, nested operations:

../../../_images/safe-sums-3.png

Let’s walk through an example of how the numbers in column E get crunched:

../../../_images/safe-sums-4.png

Tips and tricks for safe sums in Dataiku DSS

Your first option is to get rid of null values. Maybe you replace them with 0 or maybe you do something else; it’ll depend on your data.

There are many cases, however, where you want to keep null values to do other types of analyses (count null, anyone?). If so, do either of these:

  • Use the sum() function instead: sum([A,B,C,D]) ← see those brackets? Use them!

  • Or use a nested IF: if(isnull(A),0,A)+if(isnull(B),0,B) ← tedious, but also effective

  • Or you can replace if(isnull(A),0,A) with coalesce(A,0)