By Pablo Martin, Artelnics.
Principal components analysis (PCA) is a statistical technique that allows to identify underlying linear patterns in a data set so it can be expressed in terms of other data set of significatively lower dimension without much loss of information.
The final data set should be able to explain most of the variance of the original data set by making a variable reduction. The final variables will be named as principal components.
The following image depicts the activity diagram that shows each step of the principal component analysis that will be explained in detail later.
To illustrate the process described in the previous diagram, we make use of the following data set which has two dimensions.
Instance | x_{1} | x_{2} | Instance | x_{1} | x_{2} | |
---|---|---|---|---|---|---|
1 | 0.3 | 0.5 | 11 | 0.6 | 0.8 | |
2 | 0.4 | 0.3 | 12 | 0.4 | 0.6 | |
3 | 0.7 | 0.4 | 13 | 0.3 | 0.4 | |
4 | 0.5 | 0.7 | 14 | 0.6 | 0.5 | |
5 | 0.3 | 0.2 | 15 | 0.8 | 0.5 | |
6 | 0.9 | 0.8 | 16 | 0.8 | 0.9 | |
7 | 0.1 | 0.2 | 17 | 0.2 | 0.3 | |
8 | 0.2 | 0.5 | 18 | 0.7 | 0.7 | |
9 | 0.6 | 0.9 | 19 | 0.5 | 0.5 | |
10 | 0.2 | 0.2 | 20 | 0.6 | 0.4 |
The following plot shows the values of the variable x_{1} against the values of the variable x_{2}.
The objective is to calculate the principal components to convert it into a data set of only one dimension with the minimal loss of information.
The first step in the principal component analysis is to subtract the mean for each variable of the data set, which is shown in the next chart for our example.
As we can see, the subtraction of the mean results in a translation of the data which have now zero mean.
The covariance of two random variables measures the degree of variation from their respective means with respect to each other. The sign of the covariance provides us with information about the relation between them:
These values will determine the linear dependencies between the variables which will be used to reduce the dimension of the data set. Back to our example, the covariance matrix is shown next.
The values of the diagonal show the covariance of each variable and itself and they equal their variance. The variance is a measure of how spread are data from the mean. The off-diagonal values show the covariance between the two variables. In this case, these values are positive, which means that both variables increase and decrease together.
Eigenvectors are defined as those vectors whose directions remain unchanged after any linear transformation has been applied to them. However, their length could not remain the same after the transformation, i.e., the result of this transformation is the vector multiplied by a scalar. This scalar is called eigenvalue and each eigenvector has one associated to it.
The number of eigenvectors or components that we can calculate for each data set is equal to the dimension of the data set. In this case, we have a 2-dimensional data set so the number of eigenvectors will be 2. The next image represents the eigenvectors for our example.
Since they are calculated from the covariance matrix described before, eigenvectors represent the directions in which the data have more variance. On the other hand, their respective eigenvalues determine the amount of variance that the data set has in that direction.
Once we have obtained these new directions, we can plot the data in terms of them as shown in the next image for our example.
Note that the data have not changed, we are just rewriting them in terms of these new directions instead of the previous x_{1}-x_{2} directions.
Among all the available eigenvectors that have been calculated in the previous step, we must select those ones onto which we project the data. The selected eigenvectors will be called principal components.
to establish a criterion to select the eigenvectors, we must first define the relative variance of each eigenvector and the total variance of a data set. The relative variance of an eigenvector measures how much information can be attributed to it. The total variance of a data set is the sum of the variance of all the variables.
These two concepts are determined by the eigenvalues. For our example, the next table shows the relative and the cumulative variance for each eigenvector.
As we can see, the first eigenvector can explain almost the 85% of all the variance of the data while the second eigenvector explains around the 15% of it. The next graph shows the cumulative variance for the components.
A common way to select the variables is establish the amount of information that we want the final data set to explain. If this amount of information decreases, the number of principal components that we select will decrease as well. In this case, as we want to reduce the 2-dimensional data set into a 1-dimensional data set, we will select just the first eigenvector as principal component. As a consequence, the final reduced data set will explain around 85% of the variance of the original one.
Once we have selected the principal components, the data must be projected onto them. The next image shows the result of this projection for our example.
Although this projection can explain most of the variance of the original data, we have lost the information about the variance along the second component. In general, this process is irreversible, which means that we cannot recover the original data from the projection.
Principal components analysis is a technique that allows us to identify the underlying dependencies of a data set and to reduce significatively its dimensionality attending to them.
This technique is very useful for processing data sets with hundreds of variables while maintaining, at the same time, most of the information from the original data set.
Principal components analysis can be also implemented within a neural network. However, since this process is irreversible, the reduction of the data may be done only for the inputs and not for the target variables.