Concept | Sampling on datasets#

Watch the video

Sampling benefits#

Once you’ve created and/or imported a dataset into Dataiku, you’ll want to explore its contents.

Exploring very large datasets can be difficult, as even simple operations can be expensive in computational resources and time. Dataiku addresses this problem by only displaying a sample when exploring and preparing data.

Note

The same sampling principle applies to visualization in charts, data preparation in a Prepare recipe, and statistical analyses.

The main purpose of sampling is to provide immediate visual feedback while exploring and preparing the dataset, no matter how large it may be. Therefore, because you are only viewing a relatively small sample of the data, you can quickly:

Action

Interface

Sort with a column

Screenshot of the Sort menu of a dataset column.

Filter the data

Screenshot of a column filter.

Display column distributions

Screenshot of the column distribution view.

Apply conditional formatting

Screenshot of the Color column by value menu.

View summary statistics

Screenshot of the Analyze window.

Sampling methods#

By default, the sample Dataiku uses for any dataset includes the first 10,000 rows. You can see this information at the top of the Explore tab of the dataset. Next to it, Dataiku also indicates the total number of rows in your dataset.

Although taking the first 10,000 rows is the fastest sampling method, the sample may be biased depending on the composition of the dataset.

Depending on your needs, there are a number of different sampling methods available, such as random, stratified, or class rebalancing, to name a few. Clicking on the Sample button on top of the page opens the Sampling settings panel, where you can select a sampling method and/or increase the number of rows to be included in the sample.

Screenshot of the Sampling panel and sampling information on top of the dataset.

Caution

The tradeoff for a potentially more representative sample is the time needed for Dataiku to make a full pass or sometimes two full passes of the data.

What’s next?#

In this article, you learned about sampling in Dataiku, and how this allows for immediate visual feedback while exploring data no matter how large the dataset. Continue learning about the basics of Dataiku by exploring the Analyze window.

See also

For more information, see:

  • The Sampling article in the reference documentation.

  • The Sampling article in the Developer Guide.