Concept | Predictive modeling#

The motivation for predictive modeling#

With the availability of quick modeling tools, it’s easy to jump in and start building machine learning models. However, they’re a few key concepts that will better prepare you for building a high performance model.

To understand how prediction works, we’ll take a closer look at the concept of labeled and unlabeled data that was introduced in the Concept | Introduction to machine learning lesson. Then, we’ll touch on how labeled data is used in supervised machine learning.

Labeled versus unlabeled data#

Let’s look at the difference between labeled and unlabeled data, using an example, where:

The labeled dataset contains patient information that has a class label representing an outcome, sick or not sick, where 1 means the patient is sick.
The unlabeled dataset contains customer information but doesn’t have a class label column representing an outcome.

Labeled data (Supervised learning)#

In this example, the column, Sick, is the class label column. Using this column, along with patient characteristics (also known as variables or features) such as age gender, and smoking, we can teach the machine the relationship between the features and the resulting outcome.

This approach falls under the category of prediction, also referred to as supervised learning.

Unlabeled data (Unsupervised learning)#

By contrast, the unlabeled dataset contains information about customers including which products they have purchased. Here, there is no class label, but we can still use the information to teach the machine. The machine can learn about these customers by finding patterns and similarities in the data set.

This is a machine learning method known as Clustering, and it falls under the category of unsupervised learning.

In the section on clustering, we will discuss this type of dataset in more depth.

Classification vs. regression#

Prediction problems can be either classification or regression.

In a classification problem, prediction means that we use some training data to understand how input variables (such as attributes of a patient) relate to the output. This output could be a class label, such as Sick or Not Sick. When they are only two class labels, it’s referred to as binary classification. The model is trained to understand the relationship between the input variables and the output variable (or class label).

In a regression problem, rather than predicting a categorical, or discrete, value, we predict a continuous, quantitative value.

To further explain the distinction between classification and regression, consider that we want to understand how the number of sleep hours and study hours (the predictors) determine if a student will Succeed or Fail, where Succeed and Fail are two class labels. This is a classification problem.

If we use those same predictors to determine the test scores of students, then we have a regression problem. A good way to remember this distinction is that:

Classification is about predicting a category or class label.
Regression is about predicting a quantity.

Next steps#

In this lesson, we learned about the basic concepts around prediction including labeled and unlabeled data, and how we can use labeled data in supervised machine learning.

In upcoming lessons, you will get a chance to dive into the concepts of classification, regression, and clustering more closely.