Partitioned Datasets

When interacting with a partitioned dataset, Dataiku DSS displays options specific to partitions in the tabs of the dataset and when viewing the Flow. Let’s look at the options available in more detail.

Explore Tab

In Dataiku DSS, the Explore tab displays a sample of the dataset. With a partitioned dataset, we can configure the sample to include some or all partitions so that we can perform computations on the selected partitions.

../../_images/view-dataset-sample.png

Sample Size and Partitions

By default, Dataiku DSS creates the sample using all of the partitions. The default sample size is 10,000 rows. This is the same as with a non-partitioned dataset.

If we use the “first records” sampling method, and our first partition contains less than 10,000 rows, we will see at least two partitions in our sample, or as many partitions as needed to reach the sample size of 10,000 rows. However, if our first partition contains more than 10,000 rows, then we will only see the records of the first partition.

../../_images/sample-size-partition-impact.png

Charts Tab

By default, charts build visualizations over the same sample defined in the Explore tab of the dataset. We can change the sample used for creating the charts, in order to target a specific partition or partitions of the data.

../../_images/charts-tab-select-partitions.png

Statistics Tab

By default, Dataiku DSS uses all available partitions when creating a statistical analysis. When creating an analysis, we can target a specific partition, or partitions, rather than all available partitions. For example, we can select the partition we want to use in each card.

../../_images/stats-tab-select-partitions.png

Status Tab

The Status tab of a dataset lets us compute metrics, or metadata, about our dataset. Once our dataset is partitioned, the options for computing expand. While we can still compute metrics for the whole dataset, we can also select to compute metrics at the partition level.

../../_images/status-tab-compute-partitions.png

Settings Tab

We can also interact with partitioning from the Partitioning tab of the Settings page. For file-based partitioning, where partitioning is defined by the layout of the files on disk, we can list all partitions and preview how the files are stored on the Dataiku DSS server.

The preview displays the number of partitions and includes a link to the corresponding file for the partition. Whenever we ask Dataiku DSS to perform a computation on a partition, it starts by reading the corresponding file. To discover the pattern used to store the data in each partition, we can look at the Pattern field.

../../_images/settings-tab-partitioning-file-based.png

For SQL-based, or column-based, partitioning, the “Partitioning” tab displays a section dedicated to the configuration of our partitioning dimension.

../../_images/settings-tab-partitioning-column-based.png

If we make changes to this section once our partitioning dimension is already set up and propagated across the Flow, then we would need to update the dataset dependencies for each dependent dataset in the Flow.

Flow Views

Using Flow Views, we can easily see where we have configured partitions in our Flow.

Partitioning Schemes

By selecting Partitioning schemes, Dataiku DSS visually identifies the parts of the Flow where we have used partition dimensions. This gives us a visual way to validate our partition configuration.

../../_images/partitioning-schemes.png

Partitions Count

By selecting Partitions count, Dataiku DSS visually identifies the number of partitions available on each dataset. This gives us a visual way to validate the consistency of our partitions among the entire Flow.

What’s next?

In this summary, we looked at some of the available options when interacting with a partitioned dataset. You can try out these options in the hands-on lessons in the Advanced Partitioning course.