Concept | Safe sums across columns in Dataiku formulas#
People will usually tell you that 1+1=2
. In general, you can believe them. However, in some cases in Dataiku, the following may also be true from time to time : 1+1=11
. What’s happening here? When are numbers concatenated rather than summed?
When 1+1=11 and 1+1=2 both happen#
Let’s use this example:
A |
B |
C |
D |
= A + B + C + D |
---|---|---|---|---|
1 |
2 |
3 |
4 |
10 |
null |
2 |
3 |
4 |
234 |
1 |
null |
3 |
4 |
134 |
1 |
2 |
null |
4 |
34 |
1 |
2 |
3 |
null |
6 |
“But all my data is stored as integers! What’s the logic behind this add/concatenate behavior?”
Logic:
1+2+3+4=10 ← add 1 + 2, add 3, add 4
null+2+3+4=234 ← (null + 2), then concatenate 3, concatenate 4
1+null+3+4=134 ← (1 + null), then concatenate 3, concatenate 4
1+2+null+4=34 ← add 1 + 2, then (+ null), then concatenate 4
1+2+3+null=6 ← add 1 + 2, add 3, (+ null), then?
See the pattern of how addition turns into concatenation? It always happens after a null value. Why? The short answer is, “It’s complicated.”
Switching from Add → Concatenate#
Here’s the long answer:
Auto-detection for operations occurs by value and not by column.
The detection of the column datatype occurs after all the operations have been completed.
Concatenate always follows null +
anything. It doesn’t matter that the data in your columns are numbers, or that they’re stored as a number data type, or that logically 1+nothing+3+4 should be 8, not 134.
What happened in the case of the lovely A+B+C+D above is that null values are turned into “”
. Whenever the processor sees anything with a + “”
it defaults to concatenation and turns the result into a string.
Tips and tricks for safe sums in Dataiku#
Your first option is to get rid of null values. Maybe you replace them with 0 or maybe you do something else. It’ll depend on your data.
There are many cases, however, where you want to keep null values to do other types of analyses (count null, anyone?). If so, you have a few options:
Use the
sum()
function:sum([A,B,C,D])
.Include a nested IF:
if(isnull(A),0,A)+if(isnull(B),0,B)
.Replace
if(isnull(A),0,A)
withcoalesce(A,0)
.