k-means®: k-means clustering in C++, Python, Ruby, and R.
k-means clustering is a heuristic method of vector quantization, that is popular for cluster analysis in data mining.
The k-means algorithm is a type of unsupervised machine learning algorithm. It is a clustering algorithm that operates on n-dimensional data sets. The algorithm operates by choosing random centers in the data space, then fine-tuning the location of those centers by minimizing the sum of squared distances from the points of the group assigned to a particular center. It is an iterative algorithm, which alternately refines the position of the centers, then reassigns data points to the nearest centre, if necessary. The assignment of a data point to single center result in the formation of clusters.
The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the objects in D into k clusters, C1,..., Ck, that is Ci ⊂ D and Ci ∩ Cj = ø for (1≤ i,j ≤ k). An objective function is used to asses the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. This is the objective function aims for high intracluster similarity and low intercluster similarity.
A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can be defined in various ways such as by the mean or medoid of the objects (or points) assigned to the cluster. The difference between an object p ∈ Ci and ci, the representative of the cluster, is measured by dist(p, ci), where dist(x,y) is the Euclidean distance between two points x and y.