HyperAI

Feature Selection

Feature Selection is the process of isolating the most consistent, non-redundant, and relevant subset of features for model building. As datasets continue to grow in size and variety, it is important to methodically reduce the size of the dataset. The main goal of feature selection is to improve the performance of predictive models and reduce the computational cost of modeling.

Example usage of feature selection

Feature selection is an effective preprocessing technique for various practical applications, such as text classification, remote sensing, image retrieval, microarray analysis, mass spectrometry, sequence analysis, etc.

Here are some real-life examples of feature selection:

  1. Mammographic Image Analysis
  2. Criminal behavior modeling
  3. Genomic data analysis
  4. Platform monitoring
  5. Mechanical integrity assessment
  6. Text Clustering
  7. Hyperspectral image classification
  8. Sequence analysis

Importance of Feature Selection

In the process of machine learning, using feature selection can make the process more accurate. It also improves the predictive power of the algorithm by selecting the most critical variables and eliminating redundant and irrelevant variables. This is why feature selection is important.

The three main benefits of feature selection are:

  1. Reduce overfitting  
    Redundant data means less chance of making decisions based on noise.
  2. Improve accuracy  
    Less misleading data means greater modeling accuracy.
  3. Reduce training time  
    Less data means faster algorithms.

Feature Selection Methods

Feature selection algorithms are divided into supervised and unsupervised: supervised can be used for labeled data, unsupervised can be used for unlabeled data. Unsupervised techniques are divided into filter methods, wrapper methods, embedding methods or hybrid methods:

  • Filter Method: Filtering methods select features based on statistics rather than feature selection cross-validation performance. A selected metric is applied to identify irrelevant attributes and perform recursive feature selection. Filtering methods can be univariate, where an ordered ranked list of features is built to inform the final selection of a feature subset; or multivariate, which evaluates the relevance of the entire feature set, identifying redundant and irrelevant features.
  • Packaging Method: Wrapper feature selection methods treat the selection of a set of features as a search problem, assessing the quality of features by preparing, evaluating, and comparing feature combinations with other feature combinations. This method helps detect possible interactions between variables. Wrapper methods focus on a subset of features that will help improve the quality of the results of the clustering algorithm used for selection. Popular examples include Boruta feature selection and Forward feature selection.
  • Embedded Method: Embedded feature selection methods integrate feature selection machine learning algorithms as part of the learning algorithm, where classification and feature selection are performed simultaneously. The features that contribute most to each iteration of the model training process are carefully extracted. Random forest feature selection, decision tree feature selection, LASSO feature selection are common embedded methods.

References

【1】https://www.heavy.ai/technical-glossary/feature-selection

【2】https://h2o.ai/wiki/feature-selection/

【3】https://en.wikipedia.org/wiki/Feature_selection