Reference | Handling decimal notations#
When preparing data, you often encounter numeric data in a variety of formats from around the world.
This brief tutorial introduces how Dataiku handles conversion of decimal notations into a universally-understood raw format.
Decimal notations#
Many parts of the world commonly display large and decimal numbers as 1,234,567.89
. However, this same number, depending on the country, might be more commonly written as:
1234567.89
1234567,89
1 234 567,89
And many other ways
Since Dataiku needs to assist different systems in talking to each other, and those systems may not have the same opinions, Dataiku only treats “computer-notation” numbers as decimals, out of the box.
Thus, both for the float
and double
storage types, and for the Decimal
meaning, Dataiku will only accept the following kind of notation:
1234567.89
1.23456789E6
-1234.33
…
Note
You might want to re-read our reference documentation on Schemas, storage types and meanings.
While Dataiku could recognize more forms, other systems, such as Hive, would not, and that would cause various inconsistencies.
Thus, for example, Dataiku will recognize 1,234,567.89
as a string, not a number.
Normalizing in a Prepare recipe#
You can use a Prepare script (either in a visual analysis or a recipe) to handle datasets with various kinds of numeric representations. In particular, this is a job for the Convert number formats processor.
Here is a snippet of a dataset in a visual analysis containing decimals formatted in both US and French styles.
For the us_notation column, Dataiku predicts a meaning of “Decimal”, but the first two values are invalid. On the other hand, Dataiku predicts a meaning of “Decimal (comma)” for the fr_notation column. Our goal is for Dataiku to recognize both of these columns as valid decimals.
For the fr_notation column, Dataiku suggests a conversion from the French decimal format to a regular decimal. This steps uses the Convert number formats processor to convert this column to a Decimal meaning.
The same processor can fix the us_notation column. Add a new step to the script and find the Convert number formats processor. The input format should be recognized as “English” and the output format set to “Raw”.
Now Dataiku recognizes all values of both output columns with a Decimal meaning, and can be processed as such by all Dataiku-supported compute engines.