Concept | Safe sums across columns in Dataiku formulas#
People will usually tell you that 1+1=2
. In general, you can believe them; however, in some cases in Dataiku, the following may also be true from time to time : 1+1=11
. What’s happening here? When are numbers concatenated rather than summed?
When do 1+1=11 and 1+1=2 both happen?#
Let’s use this example:
A |
B |
C |
D |
= A + B + C + D |
---|---|---|---|---|
1 |
2 |
3 |
4 |
10 |
null |
2 |
3 |
4 |
234 |
1 |
null |
3 |
4 |
134 |
1 |
2 |
null |
4 |
34 |
1 |
2 |
3 |
null |
6 |
“But all my data is stored as integers! What is the logic behind this add/concatenate behavior?”
Logic:
1+2+3+4=10 ← add 1+2 , add 3, add 4
null+2+3+4=234 ← (null+2), then concatenate 3, concatenate 4
1+null+3+4=134 ← (1+null), then concatenate 3, concatenate 4
1+2+null+4=34 ← add 1+2, then (+ null), then concatenate 4
1+2+3+null=6 ← add 1+2 , add 3, (+null) then…
See the pattern of how addition turns into concatenation? It always happens after a null value. Why? The short answer is, “It’s complicated.”
Switching from Add → Concatenate#
Here’s the long answer:
Auto-detection for operations occurs by value and not by column.
The detection of the column datatype occurs after all the operations have been completed.
Concatenate always follows null +
anything. It doesn’t matter that all of the data in your columns are numbers, or that they’re stored as a number data type, or that logically 1+nothing+3+4 should be 8, not 134.
What happened in the case of the lovely A+B+C+D above is that null values are turned into “”
. Whenever the processor sees anything with a + “”
it defaults to concatenation and turns the result into a string.
Tips and tricks for safe sums in Dataiku#
Your first option is to get rid of null values. Maybe you replace them with 0 or maybe you do something else; it’ll depend on your data.
There are many cases, however, where you want to keep null values to do other types of analyses (count null, anyone?). If so, you have a few options:
Use the
sum()
function:sum([A,B,C,D])
.Include a nested IF:
if(isnull(A),0,A)+if(isnull(B),0,B)
.Replace
if(isnull(A),0,A)
withcoalesce(A,0)
.