Concept | Sampling on datasets#
Watch the video
Once you’ve created/imported a dataset into Dataiku, you want to explore the values inside.
Exploring very large datasets can be unwieldy, as even simple operations can be expensive, both in terms of computational resources and time. The approach Dataiku takes to solving this problem is to display only a sample when exploring and preparing data.
The main purpose of sampling is to provide immediate visual feedback while exploring and preparing the dataset, no matter how large it may be. This means that because Dataiku is only viewing a relatively small sample of the data, you can very quickly:
Sort the sample by a column.
Apply a filter.
Display column distributions.
Use conditional formatting.
View summary statistics.
By default, the sample Dataiku uses for any dataset includes the first 10,000 rows. You can see this information at the top of the Explore tab of the dataset. Next to it, Dataiku also indicates the total number of rows in your dataset.
Although taking the first 10,000 rows is the fastest sampling method, the sample may be biased depending on the composition of the dataset.
Depending on your needs, there are a number of different sampling methods available, such as random, stratified, or class rebalancing, to name a few. Clicking on the Sample button on top of the page opens the Sampling settings panel, where you can select a sampling method and/or increase the number of rows to be included in the sample.
The tradeoff for a potentially more representative sample is the time needed for Dataiku to make a full pass or sometimes two full passes of the data.
In this article, you learned about sampling in Dataiku, and how this allows for immediate visual feedback while exploring data no matter how large the dataset. Continue learning about the basics of Dataiku by exploring the Analyze window.
For more information, see the Sampling article in the reference documentation.