Handling Decimal Notations¶
When preparing data, you often encounter numeric data in a variety of formats from around the world.
This brief tutorial introduces how Dataiku DSS handles conversion of decimal notations into a universally-understood raw format.
Many parts of the world commonly display large and decimal numbers as
1,234,567.89. However, this same number, depending on the country, might be more commonly written as:
1 234 567,89
And many other ways
Since Dataiku DSS needs to assist different systems in talking to each other, and those systems may not have the same opinions, DSS only treats “computer-notation” numbers as decimals, out of the box.
Thus, both for the
double storage types, and for the
Decimal meaning, DSS will only accept the following kind of notation:
You might want to re-read our documentation about storage types and meanings
While DSS could recognize more forms, other systems, such as Hive, would not, and that would cause various inconsistencies.
Thus, for example,
1,234,567.89 will be recognized as a string by DSS, not a number.
Normalizing in a Prepare recipe¶
You can use a Prepare script (either in a visual analysis or a recipe) to handle datasets with various kinds of numeric representations. In particular, this is a job for the Convert number formats processor.
Here is a snippet of a dataset in a visual analysis containing decimals formatted in both US and French styles.
For the us_notation column, DSS predicts a meaning of “Decimal”, but the first two values are invalid. On the other hand, DSS predicts a meaning of “Decimal (comma)” for the fr_notation column. Our goal is for DSS to recognize both of these columns as valid decimals.
For the fr_notation column, DSS suggests a conversion from the French decimal format to a regular decimal. This steps uses the Convert number formats processor to convert this column to a Decimal meaning.
The same processor can fix the us_notation column. Add a new step to the script and find the Convert number formats processor. The input format should be recognized as “English” and the output format set to “Raw”.
Now DSS recognizes all values of both output columns with a Decimal meaning, and can be processed as such by all DSS-supported compute engines.