17.7.1.3 Algorithms(Principal Component Analysis)


Principal Component Analysis examines relationships of variables. It can be used to reduce the number of variables in regression and clustering, for example.

Each principal component in Principal Component Analysis is the linear combination of the variables and gives a maximized variance. Let X be a matrix for n observations by p variables, and the covariance matrix is S. Then for a linear combination of the variables

z_1=\sum_{i=1}^p a_{1i}x_i

where x_i\ is the ith variable, a_{1i} \ i=1,2,...,p are linear combination coefficients for z_1\ , they can be denoted by a column vector a_1\ , and normalized by a_1^Ta_1=1. The variance of z_1\ will be a_1^TSa_1.

The vector a_1\ is found by maximizing the variance. And z_1\ is called the first principal component. The second principal component can be found in the same way by maximizing:

a_2^TSa_2 subject to the constraints a_2^Ta_2=1 and a_2^Ta_1=0

It gives the second principal component that is orthogonal to the first one. Remaining principal components can be derived in a similar way. In fact coefficients a_1, a_2, ..., a_p\ can be calculated from eigenvectors of the matrix S. Origin uses different methods according to the way of excluding missing values.

Listwise Exclusion of Missing Values

An observation containing one or more missing values will be excluded in the analysis. And a matrix X_s\ for SVD can be derived from X depending on the matrix type for analysis.

Matrix Type for Analysis

  • Covariance Matrix
Let X_s\ be the matrix X with each column's mean subtracted from each variable and each column scaled by \frac{1}{\sqrt{n-1}}.
  • Correlation Matrix
Let X_s\ be the matrix X with each column's mean subtracted from each variable and each column scaled by \frac{1}{\sqrt{n-1}\sigma_i} where \sigma_i\ is the standard deviation of the ith variable.

Quantities to Compute

Perform SVD on X_s\ .

X_s=V\Lambda P^T\

where V is an n by p matrix with V^TV=I\ , P is a p by p matrix, and \Lambda is a diagonal matrix with diagonal elements s_i \ i=1, 2, ..., p.

  • Eigenvalues
\lambda_i=s_i^2
Eigenvalues are sorted in descending order. The proportion of variance explained by the ith principal component is \lambda_i/\sum_{k=1}^p \lambda_k.
  • Eigenvectors
Eigenvectors are also known as loadings or coefficients for principal components. Each column in P is the eigenvector corresponding to the eigenvalue or principal component.
Note that the eigenvector's sign is not unique for SVD, Origin normalizes its sign by forcing the sum of each column to be positive.
  • Scores
Each column in \sqrt{n-1}V\Lambda is the scores corresponding to the principal component. And scores will be missing values corresponding to an observation containing missing values.
Note that variance of scores for each principal component equals its corresponding eigenvalue for this method.
  • Standardized Scores
Scores for each principal component are standardized so that they have unit variance.

Pairwise Exclusion of Missing Values

An observation is excluded only in the calculation of covariance or correlation between two variables if missing values exist in either of the two variables for the observation.

Eigenvalues and eigenvectors are calculated from the covariance or correlation matrix S.

SP=PD\

where P is a p by p matrix and D is a diagonal matrix with diagonal elements \lambda_i \ i=1, 2, ..., p.

  • Eigenvalues
\lambda_i\ is the ith eigenvalue for the ith principal component. And eigenvalues are sorted in descending order.
Note that eigenvalues can be negative for missing values excluded in a pairwise way, which will make no sense for principal components. Origin sets the loading and scores to zeros for a negative eigenvalue.
  • Eigenvectors
Each column in P is the eigenvector corresponding to the eigenvalue or principal component.
Note that the eigenvector's sign is not unique; Origin normalizes its sign by forcing the sum of each column to be positive.
  • Scores
V=X_0P\
where X_0\ is the matrix X with each column's mean subtracted from each variable.
Scores will be missing values corresponding to an observation containing missing values.
Note that variance of scores for each principal component may not equal its corresponding eigenvalue for this method.
  • Standardized Scores
Scores for each principal component are scaled by the square root of its eigenvalue.

Bartlett's Test

Bartlett's Test tests the equality of the remaining p-k eigenvalues. It is available only when analysis matrix is covariance matrix.

H_0:\lambda_{k+1}=\lambda_{k+2}=...=\lambda_{p} k=0, 1, ..., p-2\

It approximates a \chi_2\ distribution with \frac{1}{2}(p-k-1)(p-k+2) degrees of freedom.

(n-1-(2p+5)/6)\Big\{-\sum_{i=k+1}^p \mathrm{log}(\lambda_i)+(p-k)\mathrm{log}(\sum_{i=k+1}^p \lambda_i/(p-k))\Big\}