Concept | Safe sums across columns in Dataiku formulas#

People will usually tell you that 1+1=2. In general, you can believe them; however, in some cases in Dataiku, the following may also be true from time to time : 1+1=11. What’s happening here? When are numbers concatenated rather than summed?

When do 1+1=11 and 1+1=2 both happen?#

Let’s use this example:

A

B

C

D

= A + B + C + D

1

2

3

4

10

null

2

3

4

234

1

null

3

4

134

1

2

null

4

34

1

2

3

null

6

“But all my data is stored as integers! What is the logic behind this add/concatenate behavior?”

Logic:

  • 1+2+3+4=10 ← add 1+2 , add 3, add 4

  • null+2+3+4=234 ← (null+2), then concatenate 3, concatenate 4

  • 1+null+3+4=134 ← (1+null), then concatenate 3, concatenate 4

  • 1+2+null+4=34 ← add 1+2, then (+ null), then concatenate 4

  • 1+2+3+null=6 ← add 1+2 , add 3, (+null) then…

See the pattern of how addition turns into concatenation? It always happens after a null value. Why? The short answer is, “It’s complicated.”

Switching from Add → Concatenate#

Here’s the long answer:

Auto-detection for operations occurs by value and not by column.

The detection of the column datatype occurs after all the operations have been completed.

Concatenate always follows null + anything. It doesn’t matter that all of the data in your columns are numbers, or that they’re stored as a number data type, or that logically 1+nothing+3+4 should be 8, not 134.

What happened in the case of the lovely A+B+C+D above is that null values are turned into “”. Whenever the processor sees anything with a + “” it defaults to concatenation and turns the result into a string.

Tips and tricks for safe sums in Dataiku#

Your first option is to get rid of null values. Maybe you replace them with 0 or maybe you do something else; it’ll depend on your data.

There are many cases, however, where you want to keep null values to do other types of analyses (count null, anyone?). If so, you have a few options:

  • Use the sum() function: sum([A,B,C,D]).

  • Include a nested IF: if(isnull(A),0,A)+if(isnull(B),0,B).

  • Replace if(isnull(A),0,A) with coalesce(A,0).