Concept | Principal Component Analysis (PCA)#

Watch the video

PCA#

PCA is useful for representing and visualizing data in a reduced dimensional space of uncorrelated variables that maximize the existing variations in the data. For data represented in PCA dimensions, the largest variation occurs in the direction of the first principal component, followed by the second principal component, and so on.

Graphical representation of principal component analysis (PCA).

These attributes of PCA also make it useful for data pre-processing (or feature processing) prior to model building, because reducing the features in a dataset can improve the training performance.

Principal Component Analysis card#

The PCA card displays a scree plot, a 2-dimensional scatter plot, a loading plot, and a heatmap.

The scree plot displays eigenvalues and their corresponding principal components. The curved line across the plot shows how the cumulative explained variance of the data increases with the number of principal components. Keeping all the principal components retains 100% of the variance in the data.

For dimensionality reduction applications, we set a cut-off value, such as 90%, for the explained variance, so that the minimum principal components required to attain this cut-off are then used to represent the data.

Scree plot of eigenvalues and corresponding principal components.

The 2-dimensional scatter plot represents the data set in the dimensional space of the first two principal components. Notice that the variation is largest in the direction of the first principal component.

2D scatter plot.

The loading plot shows how strongly each numerical variable influences a principal component. Loadings closer to 0 have a weaker influence on the component.

Loading plot showing how strongly each numerical variable influences one of the first two principal components.

The heatmap shows the principal component loading vectors, used in the linear transformation of the data from its original dimension to the reduced PCA dimension.

Heat map of principal component loading vectors.