Running Jobs with Partitioned Datasets

When we build a Flow with partitioned datasets, we are deciding what we want to build, then Dataiku DSS decides what to run to fulfill our request.

Partition Dependencies

To tell Dataiku DSS what we want to build, we configure the following elements:

../../_images/partition-dependencies.png

Let’s look at a few ways we could configure partition dependencies.

Time-Based Dimension: Function Type “Equals”

In this example, our input dataset is partitioned by “Day” using the column, purchase_date. Our goal is to build an output dataset made up of specific partitions using the Join recipe.

../../_images/time-equals-example.png

Our partition dependency function type is “Equals” because we want Dataiku DSS to compute the target partition using the same partition identifier value, “Day”, as the input partition.

../../_images/dependency-function-type-equals.png

In addition, we want to build only specific partitions. Our target identifier is a string, 2017-12-20,2017-12-21,2017-12-22,2017-12-23, identifying the “days” we want to build.

../../_images/example1-reciperun-options.png

Time-Based Dimension: Function Type “Since the Beginning of the Month”

In this example, we are again using a Join recipe to build a dataset using the purchase_date dimension.

../../_images/time-since-example.png

This time, we want the output dataset to contain all the partitions “Since beginning of month.”

../../_images/dependency-function-type-since.png

In addition, we’ve targeted a specific date, 2018-04-30.

../../_images/example2-reciperun-options.png

We could also use a scenario to configure the target identifier using a keyword, such as “PREVIOUS_DAY.”

../../_images/build-previous-day.png

Discrete-Based Dimension: Function Type “Equals”

In this example, we are using the Window recipe to build a dataset using the merchant_subsector_description dimension.

../../_images/discrete-equals-example.png

Our partition dependency function type is “Equals” because we want Dataiku DSS to compute the target partition using the same partition identifier value, merchant_subsector_description, as the input partition.

../../_images/example3-dependency-function.png

In addition, we’ve targeted specific partitions, gas/internet/insurance.

../../_images/example3-reciperun-options.png

In these examples, we specified what we want to build by configuring the partition dependencies.

Let’s look at three examples to demonstrate how a partitioned job runs.

Job Behavior and Activities

Request One Partition from Three Input Partitions: Function Type “Equals”

In this example, our input dataset contains three partitions. We are requesting one output partition. To build the requested partition, Dataiku DSS runs one activity. A job can have more than one activity, and an activity is the run of a recipe on a specific partition.

../../_images/one-partition-equals.png

We’ve selected “Equals” as the partition dependency function type. The dimension is already defined because there is only one partition dimension in the input dataset to choose from. If our input dataset was partitioned by more than one dimension, Dataiku DSS would display more than one input identifier option.

../../_images/one-partition-equals-dependencies.png

In the Recipe Run options, we’ll specify the target identifier, 2020-11-04, before running the recipe.

Before Dataiku DSS can run the job, it does the following:

  • It starts with the dataset that needs to be built.

  • It then looks at which recipe is upstream to gather the partition dependency information.

../../_images/job-behavior1.png

Once Dataiku DSS computes the partition dependencies, it can then build the Flow. Since we requested to build one partition, there is one activity.

../../_images/job-activity1.png

Request Three Partitions from Three Input Partitions: Function Type “Equals”

If we were to request three partitions, instead of one, using “equals”, with the same three input partitions, Dataiku DSS would run three activities–one per partition–to build three partitions.

../../_images/job-activity3.png

Request One Partition from Several Input Partitions: Function Type “Since Beginning of Month”

To use an example with a different type of mapping, let’s say we want to request one partition, using “Since beginning of month”. Using 2020-11-04 as the requested partition gives us four total input partitions.

../../_images/one-partition-nov-1st.png

To build this job, Dataiku DSS runs one activity to build one output partition that combines the four input partitions.

Even if there were 30 upstream partitions, DSS would still only generate a single, downstream partition with this particular mapping.

Building a Flow with Multiple Partitioned Datasets

Let’s consider how Dataiku DSS might build a Flow where we have four datasets and three recipes.

../../_images/build-flow-multiple-partitions.png

Each dataset in our Flow is partitioned by “Day”. In this diagram, recipes are represented by letters of the alphabet while datasets are numbered. The goal is to build the partition “2020-01-03” in the last dataset in the Flow.

To do this, Dataiku DSS starts from the last dataset and works backwards to compute the partition dependencies. For example, since recipe “f” is upstream of the final partition to be built, its partition dependencies are computed first.

Once all the partition dependencies are computed, the job activities begin using a “forward” execution in the Flow. Starting with the second dataset, the first job activity builds the partitions needed to build the partition requested, which, in turn, builds an output dataset that is needed downstream.

A third job activity builds the partitions needed to build the output of recipe “f”.

Exploring a Flow with Multiple Partitioned Datasets

Build Configuration

When we select to build a dataset in the Flow or run a recipe, Dataiku DSS displays the “build configuration”. We can choose between a non-recursive or a recursive job–the behavior remains the same. The difference is that DSS is asking us to select the partition or partitions we want to work with. This is similar to the option displayed in a visual recipe’s configuration window.

../../_images/build-transactions-joined-recursive.png

When we view a partitioned job, we can see the activities. Each activity informs us about the partition it was building.

../../_images/job-4-activities.png

Build a Partitioned Model

When building a machine learning model, we can use the target panel of the design pane to tell Dataiku DSS to use a partitioned dataset. Using this feature, we can train our machine learning models on a specific partition of the dataset.

../../_images/partitioned-models-target-panel.png

In addition, using the Train/Test Set Panel of the Design Pane, we can subsample the set on which the splitting will be performed. Using this feature, we can decide to use all partitions or a specific partition.

What’s next?

Visit the Dataiku DSS reference documentation to find out more about working with partitions, including partition identifiers, and partition dependencies. For examples of ways to use keywords, visit Variables in scenarios.