Concept | Sampling on datasets


This content is also included in the free Dataiku Academy course, Basics 101, which is part of the Core Designer learning path. Register for the course there if you’d like to track and validate your progress alongside concept videos, summaries, hands-on tutorials, and quizzes.

Sampling allows for immediate visual feedback while exploring data no matter how large the dataset. There are a number of different sampling methods available, aside from the default first 10,000 rows. The same sampling principle applies to visualization (Charts) and data prep (Prepare recipe).

A Dataiku visualization of data sampling.

Exploring very large datasets can be unwieldy, as even simple operations can be expensive, both in terms of computational resources and time. The approach DSS takes to solving this problem is to display only a sample when exploring and preparing data.

The default sample for any dataset is the first 10,000 rows. Although it is the fastest method, the sample may be biased depending on the composition of the dataset. Depending on your needs, many other sampling strategies, such as random, stratified, or class rebalancing, are available. The tradeoff for a potentially more representative sample is the time needed for DSS to make a full pass or sometimes two full passes of the data.

The main purpose of sampling is to provide immediate visual feedback while exploring and preparing the dataset, no matter how large it may be. This means that because DSS is only viewing a relatively small sample of the data, you can very quickly sort the sample by a column, apply a filter, display column distributions, color columns by values, and view summary statistics.

Learn More

In this lesson, you learned about sampling in Dataiku, and how this allows for immediate visual feedback while exploring data no matter how large the dataset. Continue learning about the basics of Dataiku by exploring the Analyze window.