< Previous | Contents | Next >

Why do we need PCA?

In this section, we will explain PCA qualitatively. By the end of this section, you will understand the qualitative meaning of the following jargons: data visualization, feature extraction, dimensional reduction, orthogonal, orthonormal, and correlation. The qualitative explanation would give you basic idea of PCA. As you read further later sections, you will learn more precise meaning and step-by-step algorithms to compute PCA. Before you read further, keep in mind the following words have similar meaning: variables, field, features, factors, and components, which is basically the columns in the data table. The transformed variables after PCA analysis are called factors or components. The following words also have similar meaning: records, objects, and points, which is basically the rows in the data table.

One of the most important goals of data science is to create model that can explain a given phenomenon or to model the behavior of a system. We observe the phenomenon or the system and do some measurements. Since we have no idea on what measurement would be useful for modeling the system, we usually attempt to measure all we can get from the target system or from the phenomenon. Our hope is to think that the more observation we have, the more accurate the result of our analysis.

Suppose now you have collected your data, consist of millions or thousands of observations and hundreds of multivariate variables. However, you feel overwhelmed with the amount of variables in your data. You can draw two or three variables in a single chart but surely, you cannot draw 100+ variables in a single chart. Drawing hundreds of charts does not help either because it is difficult comprehend them or find the visual pattern of association among these variables. Is this sound like familiar problem?

Even if you were not intent to visualize your data, having smaller number of variables, would make your job as data scientist easier. If you want to plot each two or three variables into a single chart, having ten of variables would have smaller number of charts to be analyzed than having hundreds of variables. Now you ask yourself, is it possible to represent almost the whole data that you have collected with only a few variables? What variables in your data actually contribute most information? How much is the contribution of each variable in term of percentage to the total variation? Can we identify the underlying common factors? These questions fall within the field called Feature Extraction and Dimensional Reduction. Let me explain what these two jargons means.

When we said feature extraction, we want to select a certain features (i.e. variables) which would contain most information content in the data. It turns out that most information content is related to the variation in the data. According to the Claude Shannon, the father of information science, information represents the level of "surprise" of a particular outcome. A highly improbable outcome is very surprising. Suppose your data contain a single number such as 2,2,2...,2,2. There is no variation in the data and therefore it is not surprising. When your data is random, it contains higher variation but not so useful either because the randomness may come from the noise rather than from real signal. In short, the study of information science tell us that the higher variation, the higher information content. PCA is a mathematical technique to transform the original dataset into a new dataset such that in the new dataset, the variation is maximized (i.e. the information content is the highest) and at the same time, the PCA is also removing the noise of redundant information due to strong correlation among the variables in the data. By extracting the maximum variance in the data and removing the correlation at the same time, we extract the measurement that are invariant and insensitive to variation within each class of data. The process to extract or to select such information is called feature extraction or feature selection. In using feature extraction techniques, you want to balance between simplicity and completeness. PCA model is simple because it is linear model. You can make it complete by having PCA without dimensional reduction or you can also reduce the dimension.

The word dimension means the number of variables. When we say dimensional reduction, it means we transform the original dataset into a new dataset such that the number of variables in the new dataset is reduced to be much lesser than the number of variables in the original dataset. We differentiate the name “variables” to represent the features within your data and “component” to indicate the feature within your data after PCA transformation. Why do we want to reduce the dimension? Two main reasons, first, if you can reduce the number of dimensions (i.e. the number of variables) from hundreds into just two or three variables, then you can draw a map or chart to represent the variation of the whole data. Second, it turns out that the first few components in PCA is the most important ones that capture more salient data structure of the system or the phenomenon. The last few components of the PCA actually only contains the noise. The more dimension (i.e. the number of variables) you have, the more complete your information. However, it is also the more difficult to visualize and to analyze. Higher dimension make it more difficult to find the interesting pattern within your data. When you reduce the number of dimension, it simplifies the process to find the pattern of association and you may probably visualize the pattern. The simplification captures the majority of information but at the same time, it also reduces the amount of information in the data.

It turns out that PCA would do more than that. PCA is also useful for preprocessing of our data. First, PCA produces new dataset whose variables have maximum information content of the original dataset. Second, PCA also has ability reduced the dimension. Third, PCA would also produce new components that are orthonormal to each other. What does it means to be orthonormal? The word orthonormal means the axis of the newly transformed variables (i.e. the components) are orthogonal to each other and the length of each basis vector is one. This is amazing because if your data are represented as vectors, orthogonal vectors means they are not correlated to each other. If the length of your basis vector is one, it means the standard deviation is one. Thus, PCA would magically create new variables that are uncorrelated to each other with standardized standard deviation.

Then, you may ask further on why do we need to get uncorrelated data? When two variables $$x_{1}$$ and $$x_{2}$$ are highly correlated (say, the absolute correlation between $$x_{1}$$ and $$x_{2}$$ is higher than 0.8), we can approximately say that $$x_{2}$$ can be replaced by $$x_{2}=a x_{1} + b$$, where $$a$$ and $$b$$ are constants. When $$x_{1}$$ and $$x_{2}$$ are positively correlated, constant $$a$$ is positive. When $$x_{1}$$ and $$x_{2}$$ are negatively correlated, constant $$a$$ is negative. Thus, we can say that two variables that are highly correlated contains the same information. If we put them together in a model (such as regression model), these correlated variables have redundant information that would give us false information. In multi-linear regression analysis, for instance, putting correlated variables into the same model (such as $$y=a x_{1} + b x_{2} + c$$ where $$x_{1}$$ and $$x_{2}$$ are highly correlated) would give higher $$R^{2}$$ but this is not the correct $$R^{2}$$. {Now you know how to cheat using statistics to increase $$R^{2}$$ of your linear regression for those eyes who does not really understand statistics}. Thus, we should perform PCA as the preprocessing step before performing regression analysis.

Principal Component Analysis (PCA) will search components within your data that are uncorrelated with all other components. In other words, PCA will transform the original variables in your data into new variables (i.e. the components). The original variables are correlated to each other while the components are uncorrelated.

The key to perform PCA actually comes from the fact that some of the variables in your data are actually contain some high correlation to each other. If your data is uncorrelated to each other or have very low correlation to each other, PCA as preprocessing step is not necessary. Performing PCA repeatedly on the uncorrelated data produced by the PCA would not change the result.

In summary, Principal Component Analysis (PCA) will provide you with

• Feature selection
• Dimensional reduction of your data that still captures the majority of information of your data
• The components that are mutually uncorrelated (i.e. orthogonal components) and have unit length (i.e. the standard deviation is one)
• The components that are linear combination of the original variables. The PCA will determine the weights that maximize the variation, thus guaranteeing minimal information loss.
• The percentage contribution of each component toward the variation of your data
• Ability to preprocess your data from highly correlated into uncorrelated data
• Ability to manage and visualize multi-dimensional data into 2D or 3D chart.
• Identification of underlying variables
• Ability to find the interesting pattern within your data

In the next sections, we will discuss the mechanics of PCA qualitatively and the quantitatively with numerical examples.