Concept | Sampling on datasets#

Watch the video

Sampling benefits#

Once you’ve created/imported a dataset into Dataiku, you want to explore the values inside.

Exploring very large datasets can be unwieldy, as even simple operations can be expensive, both in terms of computational resources and time. The approach Dataiku takes to solving this problem is to display only a sample when exploring and preparing data.

Note

The same sampling principle applies to visualization (Charts), data prep (Prepare recipe), and statistical analyses (Statistics).

The main purpose of sampling is to provide immediate visual feedback while exploring and preparing the dataset, no matter how large it may be. This means that because Dataiku is only viewing a relatively small sample of the data, you can very quickly:

  • Sort the sample by a column.

    Screenshot of the Sort menu of a dataset column.
  • Apply a filter.

    Screenshot of a column filter.
  • Display column distributions.

    Screenshot of the column distribution view.
  • Use conditional formatting.

    Screenshot of the Color column by value menu.
  • View summary statistics.

    Screenshot of the Analyze window.

Sampling methods#

By default, the sample Dataiku uses for any dataset includes the first 10,000 rows. You can see this information at the top of the Explore tab of the dataset. Next to it, Dataiku also indicates the total number of rows in your dataset.

Although taking the first 10,000 rows is the fastest sampling method, the sample may be biased depending on the composition of the dataset.

Depending on your needs, there are a number of different sampling methods available, such as random, stratified, or class rebalancing, to name a few. Clicking on the Sample button on top of the page opens the Sampling settings panel, where you can select a sampling method and/or increase the number of rows to be included in the sample.

Screenshot of the Sampling panel and sampling information on top of the dataset.

Caution

The tradeoff for a potentially more representative sample is the time needed for Dataiku to make a full pass or sometimes two full passes of the data.

What’s next?#

In this article, you learned about sampling in Dataiku, and how this allows for immediate visual feedback while exploring data no matter how large the dataset. Continue learning about the basics of Dataiku by exploring the Analyze window.

See also

For more information, see:

  • The Sampling article in the reference documentation.

  • The Sampling article in the developer guide.