Concept: Storage Type¶
This content is also included in the free Dataiku Academy course, Basics 101, which is part of the Core Designer learning path. Register for the course there if you’d like to track and validate your progress alongside concept videos, summaries, hands-on tutorials, and quizzes.
The schema of a Dataiku dataset is the list of columns, with their names and types. Each column has two kinds of “types” in Dataiku. There is a Storage type and a Meaning.
The storage type indicates how the dataset backend should store the column data, and how many bytes will be allocated to store these values. Common storage types are string, integer, float, boolean, and date. For a CSV file, all columns are stored as String because it’s just text.
Meanwhile the meaning gives a “rich” semantic label to the data type. Meanings are automatically detected from the contents of the columns, but you can also define custom meanings. Meanings have high-level definitions such as url, ip address, or country. Each meaning is able to validate a cell value. Therefore each cell can be valid or invalid for a given meaning.
Storage types and meanings are related. Both constrain the values that the column can contain and are useful in managing data in different ways. You can find the storage type and meaning of each column in the Dataset view, when importing a dataset, and in the Explore tab for any dataset in your project.
When you import a dataset from a connection (like a SQL table), the dataset already has defined “types” that should not be changed.
The storage type of a column impacts its ability to serve as a “key” column when joining two datasets. For example, a string column in one dataset cannot serve as the key column with an integer column in another dataset.