K-means Clustering
k-means clusteringIt is a vector quantization method that was used in signal processing in the early days. It is currently mainly used as a clustering analysis method in the field of data mining.
The purpose of k-means clustering is to divide n points into k clusters so that each point belongs to the cluster corresponding to the nearest mean, and use this as the clustering criterion. This type of problem can be understood as the problem of dividing a data space into Voronoi cells, which is mainly used for clustering two-dimensional data points.
Main steps
1. Select k points as initial mass points;
2. Repeat steps:
- Assign each point to the nearest centroid to form k clusters;
- Recalculate the centroid of each cluster;
3. Until the cluster does not change or the maximum number of iterations is reached.
Distance Metrics and Objective Functions
Consider the Euclidean distance function and use the sum of squared errors as the objective function of clustering.
Pros and Cons
Advantages: The k-means algorithm is a classic algorithm for clustering problems. The algorithm is simple and fast, and has relatively high algorithm efficiency for large amounts of data. Its high scalability is usually used as a local optimal ending algorithm. The clustering effect is better when the clusters are dense, circular, and clumpy, and the differences between clusters are obvious.
Disadvantages: The user must give the number k of clusters to be generated in advance; the algorithm is sensitive to the initial value, and different initial values will lead to different clustering results; it is sensitive to noise data and isolated point data, and a small amount of data will have a huge impact on the average value.
References
【1】k-means algorithm - Wikipedia
【2】http://hometown.group/wp-content/uploads/2018/07/%E8%81%9A%E7%B1%BB.pdf