Creating the Training & Scoring Datasets¶

To train models, we’ll use the Split recipe to create two separate datasets from the merged dataset, data_by_Asset:

a training dataset will contain labels for whether or not there was a failure event on an asset. We’ll use it to train a predictive model.
a scoring dataset will contain no data on failures, i.e., unlabelled, so we’ll use it to predict whether or not these assets have a high probability of failure.

Here are the detailed steps:

From the data_by_Asset dataset, initiate a Split recipe.
Add two output datasets, named training and scoring, selecting Create Dataset each time. Then Create Recipe.
At the Splitting step, choose to Map values of a single column. Then choose failure_bin as the column on which to split.
Assign values of 0 and 1 to the training set, and all “Other values” to the scoring set. (From the Analyze tool, we can see that these are the only possible values).
Run the recipe.

Note

In this example data, it is unclear exactly why the unlabelled data used for the scoring data are missing. Are they missing at random? Or do they represent a population that is different in some meaningful way from the labelled data now in the training dataset? This is an important data science question, but not one we can answer without knowing more about the source of the data.

../../../_images/split_maintenance_failure_copy_joined.png

Although other kinds of splits and options within Settings could produce the same outputs, hopefully this combination is the simplest. Now that we have our training dataset, we can move to the model.