Handling Decimal Notations

When preparing data, you often encounter numeric data in a variety of formats from around the world.

This brief tutorial introduces how Dataiku handles conversion of decimal notations into a universally-understood raw format.

Decimal notations

Many parts of the world commonly display large and decimal numbers as 1,234,567.89. However, this same number, depending on the country, might be more commonly written as:

Since Dataiku needs to assist different systems in talking to each other, and those systems may not have the same opinions, Dataiku only treats “computer-notation” numbers as decimals, out of the box.

Thus, both for the float and double storage types, and for the Decimal meaning, Dataiku will only accept the following kind of notation:

  • 1234567.89

  • 1.23456789E6

  • -1234.33

Note

You might want to re-read our documentation about storage types and meanings.

While Dataiku could recognize more forms, other systems, such as Hive, would not, and that would cause various inconsistencies.

Thus, for example, Dataiku will recognize 1,234,567.89 as a string, not a number.

Normalizing in a Prepare recipe

You can use a Prepare script (either in a visual analysis or a recipe) to handle datasets with various kinds of numeric representations. In particular, this is a job for the Convert number formats processor.

Here is a snippet of a dataset in a visual analysis containing decimals formatted in both US and French styles.

"A dataset with decimal columns in two US and French formats"

For the us_notation column, Dataiku predicts a meaning of “Decimal”, but the first two values are invalid. On the other hand, Dataiku predicts a meaning of “Decimal (comma)” for the fr_notation column. Our goal is for Dataiku to recognize both of these columns as valid decimals.

For the fr_notation column, Dataiku suggests a conversion from the French decimal format to a regular decimal. This steps uses the Convert number formats processor to convert this column to a Decimal meaning.

"Context menu to convert French format to regular decimal format"

The same processor can fix the us_notation column. Add a new step to the script and find the Convert number formats processor. The input format should be recognized as “English” and the output format set to “Raw”.

Prepare recipe output with converted number formats.

Now Dataiku recognizes all values of both output columns with a Decimal meaning, and can be processed as such by all Dataiku-supported compute engines.