HyperAI

Similarity Measure

Similarity MetricsIt is used to estimate the similarity between different samples and is often used as a criterion for classification problems. In machine learning and data mining, it is often necessary to know the size of the differences between individuals in order to evaluate the similarities and categories of individuals.

Currently, the most common ones are correlation analysis in data analysis, classification algorithms and clustering algorithms in data mining, such as K-nearest neighbor algorithm KNN and K-means K-Means, etc. Different measurement methods can be used according to different data characteristics.

Distance and similarity metrics

  • Distance Measure: It is used to measure the distance between individuals in space. The greater the distance, the greater the difference between individuals.
  • Similarity Measure: Calculates the similarity between individuals. The smaller the value of the similarity measure, the smaller the similarity between individuals and the greater the difference.

Commonly used similarity measurement methods

  • Cosine Similarity in Vector Space: It uses the cosine value of the angle between two vectors as a measure of the difference between individuals. Compared with the distance metric, it focuses more on the difference in direction between two vectors rather than the distance or length.
  • Pearson Correlation Coefficient: The correlation coefficient r in correlation analysis, which is calculated by performing overall standardization on X and Y and then calculating the cosine angle of the space vector;
  • Jaccard Coefficient: It is mainly used to calculate the similarity between individuals of symbolic measurement and Boolean measurement. Since the characteristic attributes of individuals are based on symbolic measurement or Boolean value identification, it is impossible to measure the specific value of the difference, and only the conclusion of "whether they are the same" can be obtained. Therefore, the Jaccard coefficient only determines the common characteristics between individuals.
  • Adjusted Cosine Similarity: The insensitivity of cosine similarity to numerical values can lead to deviations in the results. Adjusted cosine similarity is mainly used to correct this irrationality, that is, the outputs in all dimensions are subtracted from a mean.
Related words: distance metric