Concept | Dataiku datasets#

Watch the video

A dataset in Dataiku can be any piece of data in a tabular format. Examples of possible Dataiku datasets include:

  • An uploaded Excel spreadsheet

  • An SQL table

  • A folder of data files on a Hadoop cluster

  • A CSV file in the cloud, such as an Amazon S3 bucket

See also

As a developer, for more information on datasets, see the following articles in the developer guide:

Dataset representation in the Flow#

Dataiku represents all the datasets in the Flow of a project with a blue square with the icon matching the type of the source dataset.

Image showing example datasets and their representations in the Flow.

The following example Flow includes different types of datasets, such as an uploaded file, a table in a SQL database, and cloud storage datasets.

A Dataiku screenshot of a Flow containing different types of datasets, such as an uploaded file, a table in a SQL database, and cloud storage datasets.

Interactions with datasets#

Regardless of the origins of the source dataset, the methods for interacting with any Dataiku dataset are the same. You can read, write, visualize, and manipulate datasets within Dataiku using the same methods.

Indeed, the dataset interface includes:

  • An Explore tab for investigating the dataset

  • A Charts tab for visualization

  • A Statistics tab for in-depth statistical reports

  • A Data Quality tab for establishing rules

  • A Metrics tab for tracking important measurements

  • A History tab for following the dataset history (creation date, commits, etc.)

  • A Settings tab including details about the source of the dataset, either the underlying connection or the original files that were uploaded

  • The same sets of visual, code, and plugin recipes

Screenshot of a dataset interface.

This is possible because Dataiku decouples data processing logic (such as recipes in the Flow) from the underlying storage infrastructure of a dataset.

Image depicting the separation of a dataset's storage infrastructure from its processing logic.

Connections to the data#

With the exception of directly uploading files to Dataiku, the Dataiku server does not need to ingest the entire dataset to create its representation in Dataiku.

Generally, creating a dataset in Dataiku means that the user merely informs Dataiku of how it can access the data from a particular connection.

Dataiku remembers the location of the original external or source datasets. The data is not copied into Dataiku. Rather, the dataset in Dataiku is a view of the data in the original system. Only a sample of the data, as configured by the user, is transferred via the browser.

Image depicting the transfer of a dataset's sample from an external storage into the Dataiku server.

See also

For more information, see also Connecting to data in the reference documentation.

What’s next?#

In this lesson, you learned about datasets used in a Dataiku project. Continue getting to know the basics of Dataiku by learning about: