Concept | Split recipe#

Watch the video

The Split recipe divides a dataset into two or more parts based on a condition. There are four options for defining a split condition.

  • Splitting based on the values of a single column.

  • Randomly dispatching data.

  • Defining filters on one or more columns.

  • Dispatching based on percentiles of ordered data.

Splitting based on the values of a single column#

The first method is splitting a dataset based on either discrete values or ranges of values of a single column. In this example, we’ve used the Split recipe to split the transactions into three output datasets based on ranges of values in the computed column, datediff, while ensuring that each transaction appears in one and only one output.

After performing the split, the first dataset contains all customers whose first order was in the last 30 days, the second dataset contains all customers whose first order was in the last 60 days who are not part of the first group, and the third dataset contains all customers whose first order was in the last 90 days who are not part of any other group. We’ve dropped the remaining rows.

../../_images/split-map-value-ranges.png

Randomly dispatching data#

The second method is randomly splitting a dataset. For example, we can perform a three-way split of the transactions in the dataset according to ratios we specify. In this example, we’ve split the dataset into uneven proportions, such that the first output dataset contains 50% of the original transactions, the second contains 17%, and the third contains 33%.

../../_images/split-dispatch-randomly.png

Defining filters on one or more columns#

The third method is splitting a dataset using filters defined on one or more columns. In this example, we’ve split the transactions in the dataset by defining a filter on the quantity column. Specifically, the defined filter ensures all transactions with a quantity smaller than ten are placed in one dataset, while all other transactions are placed in another dataset.

../../_images/split-filters.png

Dispatching based on percentiles of ordered data#

The fourth method is splitting a dataset based on percentiles of a sorted column. In this example, we’ve split the transactions in the dataset on percentile ranges of the date column. The dataset must first be ordered using the date column. When selecting this method, the Split recipe in Dataiku prompts you to choose which column to sort. Dataiku splits the transactions into two datasets. One dataset contains the 30th percentile of the oldest transactions, while the other dataset contains the remaining transactions.

../../_images/split-percentiles.png