Concept: Datasets in DSS


This content is also included in the free Dataiku Academy course, Basics 101, which is part of the Core Designer learning path. Register for the course there if you’d like to track and validate your progress alongside concept videos, summaries, hands-on tutorials, and quizzes.

A dataset in DSS can be any piece of data in a tabular format. Examples of possible DSS datasets include:

  • an uploaded Excel spreadsheet

  • an SQL table

  • a folder of data files on a Hadoop cluster

  • a CSV file in the cloud, such as an Amazon S3 bucket

DSS will represent all of these examples in the Flow of a project with a blue square with the icon matching the type of the source dataset.

Image showing example datasets and their representations in the Flow.

Regardless of the origins of the source dataset, the methods for interacting with any DSS dataset are the same. You can read, write, visualize, and manipulate datasets within DSS using the same methods. You’ll find the same Explore, Charts, and Statistics tabs, along with the same sets of visual, code and plugin recipes.

Image showing common methods for interacting with Dataiku datasets.

This is possible because DSS decouples data processing logic (such as recipes in the Flow) from the underlying storage infrastructure of a dataset.

Image depicting the separation of a dataset's storage infrastructure from its processing logic.

With the exception of directly uploading files to DSS (as done in this Basics tutorial), the DSS server does not need to ingest the entire dataset to create its representation in DSS. Generally, creating a dataset in DSS means that the user merely informs DSS of how it can access the data from a particular connection. DSS remembers the location of the original external or source datasets. The data is not copied into DSS. Rather, the dataset in DSS is a view of the data in the original system. Only a sample of the data, as configured by the user, is transferred via the browser.

Image depicting the transfer of a dataset's sample from an external storage into the Dataiku server.

The following example Flow includes different types of datasets, such as an uploaded file, a table in a SQL database, and cloud storage datasets.

A Dataiku screenshot of a Flow containing different types of datasets, such as an uploaded file, a table in a SQL database, and cloud storage datasets.

Learn More

In this lesson, you learned about datasets used in a Dataiku project. Continue learning about the Basics of Dataiku DSS by visiting Concept: Connections.