PRINCIPAL COMPONENT ANALYSIS(PCA)- A BRIEF UNDERSTANDING

Principal Component Analysis (PCA) is an unsupervised technique for reducing the dimension of the data. The idea behind PCA is to seek the most accurate data representation in a lower dimensional space. In other words, to represent the data in a lower dimension such that when we move along the dimension, we will be moving along the data.

The amount of uncertainty and the amount of variation that is present in the high dimensional data should be entirely captured in the lower dimension. However, in many cases this might not be possible and therefore, representation cannot become all kinds of dimensions in the data. PCA tries to preserve the largest variances in the data.

Consider the linear combination of the variables:

C = w1 * y1 + w2 * y2 + w3 * y3

Where

C: Consolidated representation of the features (Components)

w1, w2, w3: PCA component loadings

y1, y2, y3: Scaled features

Most of the variability in the data – captured by PC1 (Principal Component 1) and the residual variability – captured by PC2 (Principal Component 2), which is orthogonal (independent) to PC1. PC2 tries to capture the abnormal variations in the dataset. PC1 and PC2 have 0 correlation.

When do we apply PCA?

1.       To reduce the dimensions or features of the data.

2.       Pattern recognition based on the features of the data.

3.       To resolve the multicollinearity issue

When the independent variables become highly correlated with each other, the coefficients lose their stability and interpretability. this is the multicollinearity issue.

Consider the following equation:

Y = β0 + β1 * PC1 + β2 * PC2

Where PC1 & PC2 are independent. Therefore, by definition, PCA solves the multicollinearity problem by creating two features that are independent of each other.

Steps for Performing PCA:

1.       Standardization of the data i.e., the data should Centered on the origin. 

2.       Generate the covariance/correlation matrix for all the dimensions.

Covariance/Correlation matrix captures the variation between the different variables in the original dimensions.

3.       Covariance/Correlation matrices – decomposed into the coordinate axes that rotate the dataset. It makes sure that the rotated version captures most of the variability in it. These rotated axes – Eigen Vectors and the corresponding Eigen Values -the magnitude of the variance captured.

4.       Sort the Eigen pairs in the descending order of the Eigen Values and select the one with the largest value. This is the PC1 which covers the maximum information from the original data.

5.       Finally, to see how many PCs will be useful can be seen using Scree Plot. The more the PC, more the variance explains. The lesser the PC, the more is the dimension reduction and compression in the data.

Signal to Noise Ratio:

The variation along the line which means capturing the variability – Signal and everything around this signal – Noise. The Noise represents the different aspects of the data in which the signal becomes unable to pick up. due to the random factor as far as PC1 is concerned but this is considered a signal to PC2.

In other words, PCA is the sequential way to extract the signals from the data. As we keep extracting signals from the noise, we keep extracting PCA one after the other. The more the signal extracted better is the PCA performance.

The quality of signal extraction – Signal to Noise Ratio (SNR). SNR is the ratio of Variance in Signal to Variance in Noise.

Greater SNR implies PCA being able to extract signals from the data with fewer dimensions.

Improving SNP through PCA:

It is important that we center the variance i.e., mean gets subtracted from all the points on both dimensions (xi – x̅) and (yi – ȳ). This transforms the origin of the space from (,) to (0,0). So (0,0) of the coordinate system becomes the center of the data. So now even as we rotate it around the new coordinate system, the center does not change and still remains as (0,0). Therefore ‘centring’ is crucial so that the rotation does not hurt the values themselves and allows us to capture the variation and to reduce the total error of representation. This implies SNR gets maximized.

Performance Issues of PCA:

1.       PCA’s effectiveness depends upon the scales of the attributes. If the attributes under consideration have different scales, PCA will pick variables with highest variance rather than picking up attributes based on correlation. 

2.       Changing the scales of the variables, changes the PCA. 

3.       Interpreting PCA can become challenging due to the presence of discrete data. Scaling discrete data becomes difficult. 

4.       Presence of skewness in the data with long thick tails can impact the effectiveness of the PCA. Variance is the squared standard deviation and is symmetric. Skewness does not mean symmetricity. Hence skewness affects the notion of variance and therefore the notion of PCA. 

5.       PCA, in general, assumes a linear relationship between attributes and is ineffective when the relationships are non-linear. There are versions of PCA which can capture non-linear relationships but a standard PCA cannot.

Written By: Srinidhi Devan

Reviewed By: Viswanadh

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

Leave a Comment

Your email address will not be published. Required fields are marked *