Principal Components Analysis
Principal component analysis PCA is a technique for analyzing and simplifying data sets. It uses the idea of dimensionality reduction to transform multiple indicators into fewer comprehensive indicators. PCA is a method that uses feature quantity analysis as a multivariate statistical distribution.
PCA was proposed by Karl Pearson in 1901 and was originally used to analyze data and establish mathematical models. It mainly performs eigendecomposition on the covariance matrix to obtain the principal components of the data and their weights.
Implementation of PCA algorithm
The specific operation of PCA is to find the center of the data and replace the original data with the most important factors in the data. For example, the data set is n-dimensional and contains m data (x (1), x (2), …, x (m)). Assuming that you want to reduce the dimension of these m data from n dimension to n' dimension, then these m n'-dimensional data sets will replace the original data sets and reduce the loss at the same time.
Applications of PCA
- Exploratory Data Analysis
- Data preprocessing and dimensionality reduction
- Data compression and reconstruction
Advantages and Disadvantages of PCA Algorithm
The advantages of the PCA algorithm are:
- The amount of information can be measured by variance only, which is not affected by factors outside the data set;
- The orthogonality between the principal components can eliminate the mutual influence between the original data components;
- The calculation method is simple, the main operation is eigenvalue decomposition and it is easy to implement.
The disadvantages of the PCA algorithm are:
- The meaning of each characteristic dimension of the principal component is ambiguous and not as explanatory as the original sample characteristics;
- Components with small variance may contain important information that affects sample differences, and discarding them during dimensionality reduction may have an impact on subsequent data processing.