Concept | Partitioned models#

Watch the video

In Dataiku, we can take our partitioned dataset and train a prediction model on each partition.

Why use partitioned models?#

Partitioned or stratified models may result in better predictions than a model trained on the whole dataset.

../../_images/partitioned-summary01.png

This is because subgroups related to a dataset’s partitions can sometimes have dissimilar behaviors. Therefore, they draw different patterns over the features.

Using a country as an example of a subgroup, customers in different countries could have different purchasing patterns impacting sales predictions. This could be due to differences in characteristics such as seasons.

../../_images/partitioned-summary02.png

Therefore, partitioning our data by country and training a machine learning model for each partition could result in a higher-performing model. If we trained the model using the whole dataset instead of using partitions, the model might not be able to capture all of the nuances for each country or subgroup.

Use case#

Let’s test our hypothesis in Dataiku. In this example, we’ve trained and deployed two machine learning models to predict flight arrival delay time. The flight arrival dataset has two partitions, Florida and California.

  • Random_forest has been trained over the whole data

  • random_forest_partitioned has been trained over the partitioned data.

../../_images/partitioned-summary1.png

Let’s compare the difference between our two deployed models and observe the results.

We’ll look at the non-partitioned model first. In the performance summary, it looks like our best model is the random forest. Its R2 score is pretty good.

../../_images/partitioned-summary2.png

Let’s compare this with the results of the partitioned model. Once again, our best model is the random forest.

../../_images/partitioned-summary3.png

We notice that Dataiku does not display the performance summary at the top. This is expected. We have trained not only one model, but as many models as partitions in our dataset times the number of algorithms. The summary for all of these models would not be readable.

Furthermore, Dataiku displays the R2 score as approximate because it is the summation of the results of two separate models: one trained on the florida partition, and one trained on the california partition. It’s good to see that the overall score of the partitioned model exceeded that of the non-partitioned model.

To compare the overall results with the results of the partitions, we can review the Summary page of the random forest model.

../../_images/partitioned-summary4.png

We can also examine results by partition in a partitioned model. This is more detailed than the overall results because a partitioned model is an expert of only its learning partition.

../../_images/partitioned-summary5.png